{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.为什么Bagging能改进模型性能？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bagging是集成算法中的一种，是利用多个弱学习器共同协作，以达到更好的模型性能的一种算法。Bagging在普通模型的基础上，利用每个弱学习器学习随机抽取的不同数据，让每个弱学习器学习到数据的部分特征，然后将其组合起来，就能达到更高的准确率。使用Bootstrap对样本进行随机抽样，再将其传入并行的学习器进行训练，最终结果由其各部分的训练结果经过投票的方式进行集体决策，学习器的数目设置越多，模型效果越好。由于基学习器受模型复杂程度影响性能，从而在方差上表现不好，而经过Bagging集成的模型将原方差降低为原先的1/n(假设模型之间独立分布)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.随机森林中随机体现在哪些方面？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机森林是由多棵决策树构成的集成模型，体现了Bagging思想。每棵树训练的样本是从样本整体中随机抽取的，树中每个节点的分裂属性集合也是随机选择确定的，训练结束后通过投票的方式获得结果的众数，以上都体现了随机森林的随机性，即训练样本产生的随机性和待选特征的随机性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.随机森林和GBDT的基学习器都是决策树。请问这两种模型中的决策树有什么不同。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这两种算法都是集成算法，但却有本质上的不同："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机森林是Bagging的实现，而GBDT是Boosting的实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机森林可以并行训练弱分类器，各个学习器独立进行训练，之间没有相关性，而GBDT只能串行学习弱分类器，因为它的每一个模型都要在前面已建立的所有的模型基础上进行提升学习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "随机森林的每棵树采用cart树方式进行建树，每次使用整体数据的约63.2%的数据。而GBDT是使用梯度提升的方式进行建树，以决策树为基学习器的迭代算法，GBDT里的决策树都是回归树而不是分类树，每一棵树学的是之前所有树结论和的残差，这个残差就是一个加预测值后能得真实值的累加量。训练是建立在前面已训练模型的基础上进行。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.请简述LightGBM的训练速度问什么比XGBoost快。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "XGBoost 采用预排序，在迭代之前，对结点的特征做预排序，遍历选择最优分割点，数据量大时，贪心法耗时，LightGBM 方法采用 histogram 算法，占用的内存低，数据分割的复杂度更低，但是不能找到最精确的数据分割点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM使用leaf-wise建树策略，避免了XGBoost可能分裂出很多效率很低的树。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM使用直方图做差，利用父节点和兄弟节点的差，可以提升一倍的计算速度"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Goss LightGBM对样本根据梯度进行下采样，降低样本数量，可提高计算速度"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "EFB LightGBM通过互斥特征捆绑，降低特征维度，从而提高计算速度"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM原生支持特征和数据的并行学习，这也能提高计算速度"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.采用tfidf特征，采用LightGBM（gbdt）完成Otto商品分类，尽可能将超参数调到最优，并用该模型对测试数据进行测试，提交Kaggle网站，提交排名。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt  #导入matplotlib库\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 选择otto数据,并对特征进行处理 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0,0.5,'number')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAELCAYAAAARNxsIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAG2tJREFUeJzt3XuUXnV97/H3h0RQqpBgBo25OFEHakCkMGIUEQEbEk5L6Cq0QWtyMDWrGMC7wqFHWGBaqHpQWoQTTSRYFyHiJfEYDSkXUUuAAUIuBJoxIBmCZDAhUCnQwPf8sX8jO8MzmSfD75n9DPm81nrW7P3dv71/3z2ZzHd++6qIwMzMLIe9qk7AzMxeOVxUzMwsGxcVMzPLxkXFzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2yGV53AYBs1alS0trZWnYaZ2ZBy1113PR4RLf212+OKSmtrKx0dHVWnYWY2pEj6TT3tfPjLzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLJuGFRVJCyRtkbS2V/xsSQ9IWifpn0rx8yR1pmUnluJTUqxT0rml+ARJt0vaIOk6SXs3al/MzKw+jRypXA1MKQckHQdMAw6LiEOAr6T4RGA6cEha5xuShkkaBlwBTAUmAqentgCXApdFRBuwDZjVwH0xM7M6NOyO+oi4VVJrr/CZwCUR8WxqsyXFpwGLUvxBSZ3AUWlZZ0RsBJC0CJgmaT1wPPCh1GYhcCFwZWP2ZnA9fNE7Br3P8V9cM+h9mtkrz2CfUzkIOCYdtvq5pHel+BhgU6ldV4r1FX898ERE7OgVr0nSbEkdkjq6u7sz7YqZmfU22EVlODASmAR8DlgsSYBqtI0BxGuKiHkR0R4R7S0t/T4PzczMBmiwHyjZBfwgIgK4Q9ILwKgUH1dqNxbYnKZrxR8HRkgankYr5fZmZlaRwR6p/IjiXAiSDgL2pigQS4HpkvaRNAFoA+4A7gTa0pVee1OczF+aitLNwKlpuzOBJYO6J2Zm9hING6lIuhb4ADBKUhdwAbAAWJAuM34OmJkKxDpJi4H7gB3AnIh4Pm3nLGA5MAxYEBHrUhdfABZJ+hJwDzC/UftiZmb1aeTVX6f3sehv+mg/F5hbI74MWFYjvpEXrxAzM7Mm4DvqzcwsGxcVMzPLxkXFzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLBsXFTMzy8ZFxczMsnFRMTOzbFxUzMwsGxcVMzPLxkXFzMyycVExM7NsGlZUJC2QtCW95bH3ss9KCkmj0rwkXS6pU9JqSUeU2s6UtCF9ZpbiR0pak9a5XJIatS9mZlafRo5Urgam9A5KGgf8KfBwKTyV4r30bcBs4MrU9gCK1xC/m+ItjxdIGpnWuTK17VnvJX2ZmdngauTrhG+V1Fpj0WXA54Elpdg04Jr0vvqVkkZIGk3xjvsVEbEVQNIKYIqkW4D9IuK2FL8GOAX4aWP2xmzomvs3p1bS7/n/en0l/Vq1BvWciqSTgUci4t5ei8YAm0rzXSm2q3hXjbiZmVWoYSOV3iTtC5wPTK61uEYsBhDvq+/ZFIfKGD9+fL+5mpnZwAzmSOWtwATgXkkPAWOBuyW9kWKkMa7UdiywuZ/42BrxmiJiXkS0R0R7S0tLhl0xM7NaBq2oRMSaiDgwIlojopWiMBwREb8FlgIz0lVgk4DtEfEosByYLGlkOkE/GVielj0laVK66msGO5+jMTOzCjTykuJrgduAgyV1SZq1i+bLgI1AJ/BN4OMA6QT9xcCd6XNRz0l74EzgW2mdX+OT9GZmlWvk1V+n97O8tTQdwJw+2i0AFtSIdwCHvrwszcwsJ99Rb2Zm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZNPJ1wgskbZG0thT7sqT7Ja2W9ENJI0rLzpPUKekBSSeW4lNSrFPSuaX4BEm3S9og6TpJezdqX8zMrD6NHKlcDUzpFVsBHBoRhwH/AZwHIGkiMB04JK3zDUnDJA0DrgCmAhOB01NbgEuByyKiDdgGzGrgvpiZWR0aVlQi4lZga6/YDRGxI82uBMam6WnAooh4NiIeBDqBo9KnMyI2RsRzwCJgmiQBxwPXp/UXAqc0al/MzKw+VZ5T+Sjw0zQ9BthUWtaVYn3FXw88USpQPXEzM6tQJUVF0vnADuC7PaEazWIA8b76my2pQ1JHd3f37qZrZmZ1GvSiImkm8GfAhyOipxB0AeNKzcYCm3cRfxwYIWl4r3hNETEvItojor2lpSXPjpiZ2UsMalGRNAX4AnByRDxdWrQUmC5pH0kTgDbgDuBOoC1d6bU3xcn8pakY3QycmtafCSwZrP0wM7PaGnlJ8bXAbcDBkrokzQL+BXgdsELSKklXAUTEOmAxcB/wM2BORDyfzpmcBSwH1gOLU1soitOnJXVSnGOZ36h9MTOz+gzvv8nARMTpNcJ9/uKPiLnA3BrxZcCyGvGNFFeHmZlZk/Ad9WZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWTSNfJ7xA0hZJa0uxAyStkLQhfR2Z4pJ0uaROSaslHVFaZ2Zqv0HSzFL8SElr0jqXS1Kj9sXMzOrTyJHK1cCUXrFzgRsjog24Mc0DTAXa0mc2cCUURQi4AHg3xauDL+gpRKnN7NJ6vfsyM7NB1rCiEhG3Alt7hacBC9P0QuCUUvyaKKwERkgaDZwIrIiIrRGxDVgBTEnL9ouI2yIigGtK2zIzs4oM9jmVN0TEowDp64EpPgbYVGrXlWK7infViNckabakDkkd3d3dL3snzMystmY5UV/rfEgMIF5TRMyLiPaIaG9paRlgimZm1p/BLiqPpUNXpK9bUrwLGFdqNxbY3E98bI24mZlVqN+iImmYpH/L1N9SoOcKrpnAklJ8RroKbBKwPR0eWw5MljQynaCfDCxPy56SNCld9TWjtC0zM6vI8P4aRMTzkp6WtH9EbK93w5KuBT4AjJLURXEV1yXAYkmzgIeB01LzZcBJQCfwNHBG6nurpIuBO1O7iyKi5+T/mRRXmL0G+Gn6mJlZhfotKskzwBpJK4Df9wQj4py+VoiI0/tYdEKNtgHM6WM7C4AFNeIdwKG7TtvMzAZTvUXlJ+ljZmbWp7qKSkQslPQaYHxEPNDgnMzMbIiq6+ovSX8OrAJ+luYPl7S0kYmZmdnQU+8lxRdSPCblCYCIWAVMaFBOZmY2RNVbVHbUuPKrz5sNzcxsz1Tvifq1kj4EDJPUBpwD/Hvj0jIzs6Go3pHK2cAhwLPAtcCTwCcblZSZmQ1N9V799TRwvqRLi9l4qrFpmZnZUFTv1V/vkrQGWE1xE+S9ko5sbGpmZjbU1HtOZT7w8Yj4BYCk9wHfBg5rVGJmZjb01HtO5ameggIQEb8EfAjMzMx2ssuRSuld8XdI+r8UJ+kD+GvglsamZmZmQ01/h7++2mv+gtK071MxM7Od7LKoRMRxg5WImZkNfXWdqJc0guJFWK3ldXb16Hszs6Hmwgsv3KP6bYR6r/5aBqwE1gAvNC4dMzMbyuotKq+OiE/n6lTSp4C/pTgvs4biTY+jgUXAAcDdwEci4jlJ+wDXAEcCvwP+OiIeSts5D5gFPA+cExHLc+VoZma7r95Lir8j6WOSRks6oOczkA4ljaF4dlh7RBwKDAOmA5cCl0VEG7CNoliQvm6LiLcBl6V2SJqY1jsEmAJ8Q9KwgeRkZmZ51FtUngO+DNwG3JU+HS+j3+HAayQNB/YFHgWOB65PyxcCp6TpaWmetPwESUrxRRHxbEQ8SPF++6NeRk5mZvYy1Xv469PA2yLi8ZfbYUQ8IukrwMPAfwE3UBSpJyJiR2rWBYxJ02OATWndHZK2A69P8ZWlTZfXMTOzCtRbVNYBT+foUNJIilHGBIqXfn0PmFqjac99MOpjWV/xWn3OBmYDjB8/fjczNoCj//noSvr91dm/qqRfMxuYeovK88AqSTdTPP4eGPAlxR8EHoyIbgBJPwDeC4yQNDyNVsYCm1P7LmAc0JUOl+0PbC3Fe5TX2UlEzAPmAbS3t/umTTOzBqn3nMqPgLkUL+a6q/QZiIeBSZL2TedGTgDuA24GTk1tZgJL0vTSNE9aflNERIpPl7SPpAlAG3DHAHMyM7MM6n2fysL+W9UnIm6XdD3FZcM7gHsoRhE/ARZJ+lKKzU+rzKe4+qyTYoQyPW1nnaTFFAVpBzAnIp7PlaeZme2+eu+of5Aa5ysi4i0D6TQiLmDn54gBbKTG1VsR8QxwWh/bmUsxgjIzsyZQ7zmV9tL0qyl+yQ/oPhUzM3vlquucSkT8rvR5JCK+RnFfiZmZ2R/Ue/jriNLsXhQjl9c1JCMzMxuy6j389VVePKeyA3iIPs5zmJnZnqveojIV+Et2fvT9dOCiBuRkZmZDVL1F5UcUd7/fDTzTuHTMzGwoq7eojI2IKQ3NxMzMhrx676j/d0nvaGgmZmY25NU7Unkf8D/TTZDPUjzMMSLisIZlZmZmQ87unKg3MzPbpXqf/fWbRidiZmZDX73nVMzMzPrlomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2VRSVCSNkHS9pPslrZf0HkkHSFohaUP6OjK1laTLJXVKWl1+DL+kman9Bkkz++7RzMwGQ1Ujla8DP4uIPwbeCawHzgVujIg24MY0D8WNl23pMxu4EkDSARSvJH43xWuIL+gpRGZmVo1BLyqS9gPeD8wHiIjnIuIJYBqwMDVbCJySpqcB10RhJTBC0mjgRGBFRGyNiG3ACsAPvTQzq1AVI5W3AN3AtyXdI+lbkv4IeENEPAqQvh6Y2o8BNpXW70qxvuJmZlaRKorKcOAI4MqI+BPg97x4qKsW1YjFLuIv3YA0W1KHpI7u7u7dzdfMzOpURVHpAroi4vY0fz1FkXksHdYifd1Saj+utP5YYPMu4i8REfMioj0i2ltaWrLtiJmZ7WzQi0pE/BbYJOngFDoBuA9YCvRcwTUTWJKmlwIz0lVgk4Dt6fDYcmCypJHpBP3kFDMzs4rU++j73M4Gvitpb2AjcAZFgVssaRbwMHBaarsMOAnoBJ5ObYmIrZIuBu5M7S6KiK2DtwtmZtZbJUUlIlYB7TUWnVCjbQBz+tjOAmBB3uzMzGygfEe9mZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlk1Vd9SbmVkdFn/vqEr6/avT7hjQeh6pmJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtlUVlQkDZN0j6T/l+YnSLpd0gZJ16VXDSNpnzTfmZa3lrZxXoo/IOnEavbEzMx6VDlS+QSwvjR/KXBZRLQB24BZKT4L2BYRbwMuS+2QNBGYDhwCTAG+IWnYIOVuZmY1VFJUJI0F/gfwrTQv4Hjg+tRkIXBKmp6W5knLT0jtpwGLIuLZiHgQ6ASqeZ6BmZkB1Y1UvgZ8Hnghzb8eeCIidqT5LmBMmh4DbAJIy7en9n+I11jHzMwqMOhFRdKfAVsi4q5yuEbT6GfZrtbp3edsSR2SOrq7u3crXzMzq18VI5WjgZMlPQQsojjs9TVghKSepyaPBTan6S5gHEBavj+wtRyvsc5OImJeRLRHRHtLS0vevTEzsz8Y9KISEedFxNiIaKU40X5TRHwYuBk4NTWbCSxJ00vTPGn5TRERKT49XR02AWgDBvasZjMzy6KZ3qfyBWCRpC8B9wDzU3w+8B1JnRQjlOkAEbFO0mLgPmAHMCcinh/8tM3MrEelRSUibgFuSdMbqXH1VkQ8A5zWx/pzgbmNy9DMzHaH76g3M7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLBsXFTMzy8ZFxczMsnFRMTOzbJrpMS1mtgdZP/emQe/z7ecfP+h97mk8UjEzs2w8UrEh6+fvP7aSfo+99ed9LvuXz/x4EDN50Vlf/fNK+jXrzSMVMzPLxkXFzMyycVExM7NsXFTMzCybQS8qksZJulnSeknrJH0ixQ+QtELShvR1ZIpL0uWSOiWtlnREaVszU/sNkmb21aeZmQ2OKkYqO4DPRMTbgUnAHEkTgXOBGyOiDbgxzQNMpXj/fBswG7gSiiIEXAC8m+KNkRf0FCIzM6vGoBeViHg0Iu5O008B64ExwDRgYWq2EDglTU8DronCSmCEpNHAicCKiNgaEduAFcCUQdwVMzPrpdJzKpJagT8BbgfeEBGPQlF4gANTszHAptJqXSnWV9zMzCpSWVGR9Frg+8AnI+LJXTWtEYtdxGv1NVtSh6SO7u7u3U/WzMzqUskd9ZJeRVFQvhsRP0jhxySNjohH0+GtLSneBYwrrT4W2JziH+gVv6VWfxExD5gH0N7e/ofCc+TnrnnZ+zIQd315RiX9mpk1WhVXfwmYD6yPiP9TWrQU6LmCayawpBSfka4CmwRsT4fHlgOTJY1MJ+gnp5iZmVWkipHK0cBHgDWSVqXY/wIuARZLmgU8DJyWli0DTgI6gaeBMwAiYquki4E7U7uLImLr4OyCmZnVMuhFJSJ+Se3zIQAn1GgfwJw+trUAWJAvOzMzezl8R72ZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZTPki4qkKZIekNQp6dyq8zEz25MN6aIiaRhwBTAVmAicLmlitVmZme25hnRRAY4COiNiY0Q8BywCplWck5nZHmuoF5UxwKbSfFeKmZlZBRQRVecwYJJOA06MiL9N8x8BjoqIs3u1mw3MTrMHAw9k6H4U8HiG7eTWjHk5p/o4p/o1Y16v9JzeHBEt/TUanqmzqnQB40rzY4HNvRtFxDxgXs6OJXVERHvObebQjHk5p/o4p/o1Y17OqTDUD3/dCbRJmiBpb2A6sLTinMzM9lhDeqQSETsknQUsB4YBCyJiXcVpmZntsYZ0UQGIiGXAsgq6zno4LaNmzMs51cc51a8Z83JODPET9WZm1lyG+jkVMzNrIi4qZmaWzR5ZVCS9UdIiSb+WdJ+kZZIOkrS2wf2eJmmdpBcktfdaVlVOX5Z0v6TVkn4oaUQT5HRxymeVpBskvanX8kryKvX/WUkhaVTVOUm6UNIj6Xu1StJJVeeU+j47PZNvnaR/qjonSdeVvkcPSVrVBDkdLmllyqlD0lG9lleV1zsl3SZpjaQfS9pvtzYQEXvUBxBwG/B3pdjhwDHA2gb3/XaKmy9vAdqbJKfJwPA0fSlwaRPktF9p+hzgqmb4XqW+xlFcbfgbYFTVOQEXAp+tEa8yp+OAfwP2SfMHVp1Tr/y+Cnyx6pyAG4Cpafok4JYm+fe7Ezg2TX8UuHh31t8TRyrHAf8dEVf1BCJiFaXHvUhqlfQLSXenz3tTfLSkW9NfFmslHSNpmKSr0/waSZ/qq+OIWB8Rte7mrzKnGyJiR5pdSXEDadU5PVma/SOgfDVJZXkllwGfb7KcaqkypzOBSyLi2dTvlibIqWf7Av4KuLYJcgqgZxSwPzvfuF1lXgcDt6bpFcBf7qLtSwz5S4oH4FDgrn7abAH+NCKekdRG8QPYDnwIWB4Rc1U8IXlfir8exkTEoQAqHT4agjl9FLiuGXKSNBeYAWyn+A/Wo7K8JJ0MPBIR9xa/m6rPKTlL0gygA/hMRGyrOKeDgGPSv+EzFCOpOyvOqccxwGMRsSHNV5nTJ4Hlkr5CcSrivaVlVea1FjgZWAKcxs5PLenXnjhSqcergG9KWgN8j+Kx+lAMC8+QdCHwjoh4CtgIvEXSP0uaAjxZa4PNnpOk84EdwHebIaeIOD8ixqV8ztqNnBqSl6R9gfOBL+5mLg3LKbkSeCvFL41HKQ7tVJ3TcGAkMAn4HLBYvapwBTn1OJ0XRyn1alROZwKfSj/nnwLmN0leHwXmSLoLeB3w3G5l1chjc834AU4Abq0RbyUdq6Q4Tt3z18NwYEep3ZuAjwFrgBkp9lqKIeKPKe7q7y+HW9j5nEqlOQEzKY7f7tssOZW282ZKx5Crygt4B8Vfhg+lzw7gYeCNTfS9KvdXWU7Az4APlOZ/DbRU/X1K23sMGFv1z1Nqt50X7xUU8GQz5NWrv4OAO+pp2/PZE0cqNwH7SPpYT0DSuyh+efXYH3g0Il4APkLxCBgkvRnYEhHfpPir4ggVVwDtFRHfB/43cMRQyin91fIF4OSIeLpJcmorzZ4M3F91XhGxJiIOjIjWiGileJjpERHx24q/V6NLs39Bceiisu9T8iPg+LStg4C9KZ6UW/X/vQ8C90dEVylWZU6bgWPT9PHAhtKyKn+mDkxf9wL+Hriqr7Y17U4FeqV8KKr4Yoq/oNYBPwHaePEvgDZgNcWJ638E/jPFZ1L8p70H+AUwAXgncDewKn2m7qLfv6D4ZfQsxV9My5sgp06Kk389ba9qgpy+n9ZfTfFX1Zhm+PfrlcNDpKu/Kv5efYfir9HVFA9THd0EOe0N/Gvaxt3A8VXnlLZxNaWrqarOCXgfxXmTe4HbgSObJK9PAP+RPpeQRlP1fvyYFjMzy2ZPPPxlZmYNsideUtxwkq4Aju4V/npEfLuKfMA57Y5mzMs51cc51a9Refnwl5mZZePDX2Zmlo2LipmZZeOiYpaRpBGSPl51HmZVcVExy2sEUHdRUeFl/T+U5AturGn4h9Esr0uAt6p4X8fNwGEUz8F6FfD3EbFEUivw07T8PcApkj5I8WSDzRR3Vj8bEWdJaqG4o3l82v4nI+JX6blOb6J4bMfjFA8RNKuci4pZXucCh0bE4WkEsW9EPJkek7FS0tLU7mDgjIj4uIqXkPU8OuMpikd03JvafR24LCJ+KWk8xbtc3p6WHQm8LyL+a3B2zax/LipmjSPgHyS9H3gBGAO8IS37TUSsTNNHAT+PiK0Akr5H8SA/KJ5XNbH0kN/9JL0uTS91QbFm46Ji1jgfpng675ER8d+SHgJenZb9vtRuV4+F3wt4T+/ikYrM72uuYVYhn6g3y+spindQQPEU2S2poBzHzk+YLbsDOFbSyHTIrPymvRsovU9G0uENyNksG49UzDKKiN9J+pWktRQvS/pjSR0UT4a9v491HpH0DxRPqt0M3Efxrg2Ac4ArJK2m+P96K/B3Dd4NswHzY1rMmoCk10bEf6aRyg8pXqL0w6rzMttdPvxl1hwuTJchrwUepHjRldmQ45GKmZll45GKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtn8f6hs7Hva5FFpAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_data=pd.read_csv(\"D:\\\\python\\\\csdn data\\\\otto\\\\otto_train.csv\")\n",
    "train_target=train_data[\"target\"]\n",
    "sns.countplot(train_target)\n",
    "plt.xlabel(\"targer\")\n",
    "plt.ylabel(\"number\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用tf—idf处理长尾分布的特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#分出训练数据和类标\n",
    "X=train_data.iloc[:,1:-1]\n",
    "y=train_data.iloc[:,-1]\n",
    "columns_org = X.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "特征分布非常不均匀，呈现长尾数据特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对样本特征进行log变换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_log=np.log1p(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对样本做tf-idf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform counts to TFIDF features\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "tfidf = TfidfTransformer()\n",
    "\n",
    "#输出稀疏矩阵\\\n",
    "X_tfidf = tfidf.fit_transform(X).toarray()\n",
    "#重新组成DataFrame,为了可视化\n",
    "feat_names = columns_org + \"_tfidf\"\n",
    "X_tfidf = pd.DataFrame(columns = feat_names, data = X_tfidf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对特征进行归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "mms_tfidf=MinMaxScaler()\n",
    "mms_log=MinMaxScaler()\n",
    "X_tfidf=mms_tfidf.fit_transform(X_tfidf)\n",
    "X_log=mms_tfidf.fit_transform(X_log)\n",
    "X_tfidf = pd.DataFrame(columns=columns_org,data=X_tfidf)\n",
    "X_log = pd.DataFrame(columns=columns_org,data=X_log)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>feat_10</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_84</th>\n",
       "      <th>feat_85</th>\n",
       "      <th>feat_86</th>\n",
       "      <th>feat_87</th>\n",
       "      <th>feat_88</th>\n",
       "      <th>feat_89</th>\n",
       "      <th>feat_90</th>\n",
       "      <th>feat_91</th>\n",
       "      <th>feat_92</th>\n",
       "      <th>feat_93</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.172195</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.162608</td>\n",
       "      <td>0.649561</td>\n",
       "      <td>0.289065</td>\n",
       "      <td>0.489076</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.201849</td>\n",
       "      <td>...</td>\n",
       "      <td>0.721831</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.165443</td>\n",
       "      <td>0.260365</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.172195</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.142178</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 93 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     feat_1  feat_2  feat_3    feat_4    feat_5    feat_6    feat_7    feat_8  \\\n",
       "0  0.167949     0.0     0.0  0.000000  0.000000  0.000000  0.000000  0.000000   \n",
       "1  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000  0.159571   \n",
       "2  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000  0.159571   \n",
       "3  0.167949     0.0     0.0  0.162608  0.649561  0.289065  0.489076  0.000000   \n",
       "4  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "   feat_9   feat_10   ...      feat_84   feat_85   feat_86   feat_87  feat_88  \\\n",
       "0     0.0  0.000000   ...     0.000000  0.172195  0.000000  0.000000      0.0   \n",
       "1     0.0  0.000000   ...     0.000000  0.000000  0.000000  0.000000      0.0   \n",
       "2     0.0  0.000000   ...     0.000000  0.000000  0.000000  0.000000      0.0   \n",
       "3     0.0  0.201849   ...     0.721831  0.000000  0.165443  0.260365      0.0   \n",
       "4     0.0  0.000000   ...     0.000000  0.172195  0.000000  0.000000      0.0   \n",
       "\n",
       "   feat_89   feat_90  feat_91  feat_92  feat_93  \n",
       "0      0.0  0.000000      0.0      0.0      0.0  \n",
       "1      0.0  0.000000      0.0      0.0      0.0  \n",
       "2      0.0  0.000000      0.0      0.0      0.0  \n",
       "3      0.0  0.000000      0.0      0.0      0.0  \n",
       "4      0.0  0.142178      0.0      0.0      0.0  \n",
       "\n",
       "[5 rows x 93 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_log.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "保存变换好的特征 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_85</th>\n",
       "      <th>feat_86</th>\n",
       "      <th>feat_87</th>\n",
       "      <th>feat_88</th>\n",
       "      <th>feat_89</th>\n",
       "      <th>feat_90</th>\n",
       "      <th>feat_91</th>\n",
       "      <th>feat_92</th>\n",
       "      <th>feat_93</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.075886</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.162608</td>\n",
       "      <td>0.649561</td>\n",
       "      <td>0.289065</td>\n",
       "      <td>0.489076</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.008244</td>\n",
       "      <td>0.022456</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.124622</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.145988</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 188 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id    feat_1  feat_2  feat_3    feat_4    feat_5    feat_6    feat_7  \\\n",
       "0   1  0.167949     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "1   2  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "2   3  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "3   4  0.167949     0.0     0.0  0.162608  0.649561  0.289065  0.489076   \n",
       "4   5  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "     feat_8  feat_9   ...      feat_85   feat_86   feat_87  feat_88  feat_89  \\\n",
       "0  0.000000     0.0   ...     0.075886  0.000000  0.000000      0.0      0.0   \n",
       "1  0.159571     0.0   ...     0.000000  0.000000  0.000000      0.0      0.0   \n",
       "2  0.159571     0.0   ...     0.000000  0.000000  0.000000      0.0      0.0   \n",
       "3  0.000000     0.0   ...     0.000000  0.008244  0.022456      0.0      0.0   \n",
       "4  0.000000     0.0   ...     0.124622  0.000000  0.000000      0.0      0.0   \n",
       "\n",
       "    feat_90  feat_91  feat_92  feat_93   target  \n",
       "0  0.000000      0.0      0.0      0.0  Class_1  \n",
       "1  0.000000      0.0      0.0      0.0  Class_1  \n",
       "2  0.000000      0.0      0.0      0.0  Class_1  \n",
       "3  0.000000      0.0      0.0      0.0  Class_1  \n",
       "4  0.145988      0.0      0.0      0.0  Class_1  \n",
       "\n",
       "[5 rows x 188 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pickle as pc\n",
    "X_log_tfidf2 = pd.concat([train_data[\"id\"],X_log,X_tfidf,y],axis=1)\n",
    "X_log_tfidf2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_log_tfidf2.to_csv(\"Otto_train_log_tfidf2.csv\",index=False,header=True)\n",
    "pc.dump(tfidf,open(\"tfidf.pkl\",\"wb\"))\n",
    "pc.dump(mms_log,open(\"mms_log.pkl\",\"wb\"))\n",
    "pc.dump(mms_tfidf,open(\"mms_tfidf.pkl\",\"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取数据使用lightGBM进行训练 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_85.1</th>\n",
       "      <th>feat_86.1</th>\n",
       "      <th>feat_87.1</th>\n",
       "      <th>feat_88.1</th>\n",
       "      <th>feat_89.1</th>\n",
       "      <th>feat_90.1</th>\n",
       "      <th>feat_91.1</th>\n",
       "      <th>feat_92.1</th>\n",
       "      <th>feat_93.1</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.075886</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.162608</td>\n",
       "      <td>0.649561</td>\n",
       "      <td>0.289065</td>\n",
       "      <td>0.489076</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.008244</td>\n",
       "      <td>0.022456</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.124622</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.145988</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 188 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id    feat_1  feat_2  feat_3    feat_4    feat_5    feat_6    feat_7  \\\n",
       "0   1  0.167949     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "1   2  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "2   3  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "3   4  0.167949     0.0     0.0  0.162608  0.649561  0.289065  0.489076   \n",
       "4   5  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "     feat_8  feat_9   ...     feat_85.1  feat_86.1  feat_87.1  feat_88.1  \\\n",
       "0  0.000000     0.0   ...      0.075886   0.000000   0.000000        0.0   \n",
       "1  0.159571     0.0   ...      0.000000   0.000000   0.000000        0.0   \n",
       "2  0.159571     0.0   ...      0.000000   0.000000   0.000000        0.0   \n",
       "3  0.000000     0.0   ...      0.000000   0.008244   0.022456        0.0   \n",
       "4  0.000000     0.0   ...      0.124622   0.000000   0.000000        0.0   \n",
       "\n",
       "   feat_89.1  feat_90.1  feat_91.1  feat_92.1  feat_93.1   target  \n",
       "0        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "1        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "2        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "3        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "4        0.0   0.145988        0.0        0.0        0.0  Class_1  \n",
       "\n",
       "[5 rows x 188 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"Otto_train_log_tfidf2.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(61878, 188)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0,0.5,'number')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAELCAYAAAARNxsIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAG2tJREFUeJzt3XuUXnV97/H3h0RQqpBgBo25OFEHakCkMGIUEQEbEk5L6Cq0QWtyMDWrGMC7wqFHWGBaqHpQWoQTTSRYFyHiJfEYDSkXUUuAAUIuBJoxIBmCZDAhUCnQwPf8sX8jO8MzmSfD75n9DPm81nrW7P3dv71/3z2ZzHd++6qIwMzMLIe9qk7AzMxeOVxUzMwsGxcVMzPLxkXFzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2yGV53AYBs1alS0trZWnYaZ2ZBy1113PR4RLf212+OKSmtrKx0dHVWnYWY2pEj6TT3tfPjLzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLJuGFRVJCyRtkbS2V/xsSQ9IWifpn0rx8yR1pmUnluJTUqxT0rml+ARJt0vaIOk6SXs3al/MzKw+jRypXA1MKQckHQdMAw6LiEOAr6T4RGA6cEha5xuShkkaBlwBTAUmAqentgCXApdFRBuwDZjVwH0xM7M6NOyO+oi4VVJrr/CZwCUR8WxqsyXFpwGLUvxBSZ3AUWlZZ0RsBJC0CJgmaT1wPPCh1GYhcCFwZWP2ZnA9fNE7Br3P8V9cM+h9mtkrz2CfUzkIOCYdtvq5pHel+BhgU6ldV4r1FX898ERE7OgVr0nSbEkdkjq6u7sz7YqZmfU22EVlODASmAR8DlgsSYBqtI0BxGuKiHkR0R4R7S0t/T4PzczMBmiwHyjZBfwgIgK4Q9ILwKgUH1dqNxbYnKZrxR8HRkgankYr5fZmZlaRwR6p/IjiXAiSDgL2pigQS4HpkvaRNAFoA+4A7gTa0pVee1OczF+aitLNwKlpuzOBJYO6J2Zm9hING6lIuhb4ADBKUhdwAbAAWJAuM34OmJkKxDpJi4H7gB3AnIh4Pm3nLGA5MAxYEBHrUhdfABZJ+hJwDzC/UftiZmb1aeTVX6f3sehv+mg/F5hbI74MWFYjvpEXrxAzM7Mm4DvqzcwsGxcVMzPLxkXFzMyycVExM7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLBsXFTMzy8ZFxczMsnFRMTOzbFxUzMwsGxcVMzPLxkXFzMyycVExM7NsGlZUJC2QtCW95bH3ss9KCkmj0rwkXS6pU9JqSUeU2s6UtCF9ZpbiR0pak9a5XJIatS9mZlafRo5Urgam9A5KGgf8KfBwKTyV4r30bcBs4MrU9gCK1xC/m+ItjxdIGpnWuTK17VnvJX2ZmdngauTrhG+V1Fpj0WXA54Elpdg04Jr0vvqVkkZIGk3xjvsVEbEVQNIKYIqkW4D9IuK2FL8GOAX4aWP2xmzomvs3p1bS7/n/en0l/Vq1BvWciqSTgUci4t5ei8YAm0rzXSm2q3hXjbiZmVWoYSOV3iTtC5wPTK61uEYsBhDvq+/ZFIfKGD9+fL+5mpnZwAzmSOWtwATgXkkPAWOBuyW9kWKkMa7UdiywuZ/42BrxmiJiXkS0R0R7S0tLhl0xM7NaBq2oRMSaiDgwIlojopWiMBwREb8FlgIz0lVgk4DtEfEosByYLGlkOkE/GVielj0laVK66msGO5+jMTOzCjTykuJrgduAgyV1SZq1i+bLgI1AJ/BN4OMA6QT9xcCd6XNRz0l74EzgW2mdX+OT9GZmlWvk1V+n97O8tTQdwJw+2i0AFtSIdwCHvrwszcwsJ99Rb2Zm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZNPJ1wgskbZG0thT7sqT7Ja2W9ENJI0rLzpPUKekBSSeW4lNSrFPSuaX4BEm3S9og6TpJezdqX8zMrD6NHKlcDUzpFVsBHBoRhwH/AZwHIGkiMB04JK3zDUnDJA0DrgCmAhOB01NbgEuByyKiDdgGzGrgvpiZWR0aVlQi4lZga6/YDRGxI82uBMam6WnAooh4NiIeBDqBo9KnMyI2RsRzwCJgmiQBxwPXp/UXAqc0al/MzKw+VZ5T+Sjw0zQ9BthUWtaVYn3FXw88USpQPXEzM6tQJUVF0vnADuC7PaEazWIA8b76my2pQ1JHd3f37qZrZmZ1GvSiImkm8GfAhyOipxB0AeNKzcYCm3cRfxwYIWl4r3hNETEvItojor2lpSXPjpiZ2UsMalGRNAX4AnByRDxdWrQUmC5pH0kTgDbgDuBOoC1d6bU3xcn8pakY3QycmtafCSwZrP0wM7PaGnlJ8bXAbcDBkrokzQL+BXgdsELSKklXAUTEOmAxcB/wM2BORDyfzpmcBSwH1gOLU1soitOnJXVSnGOZ36h9MTOz+gzvv8nARMTpNcJ9/uKPiLnA3BrxZcCyGvGNFFeHmZlZk/Ad9WZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWTSNfJ7xA0hZJa0uxAyStkLQhfR2Z4pJ0uaROSaslHVFaZ2Zqv0HSzFL8SElr0jqXS1Kj9sXMzOrTyJHK1cCUXrFzgRsjog24Mc0DTAXa0mc2cCUURQi4AHg3xauDL+gpRKnN7NJ6vfsyM7NB1rCiEhG3Alt7hacBC9P0QuCUUvyaKKwERkgaDZwIrIiIrRGxDVgBTEnL9ouI2yIigGtK2zIzs4oM9jmVN0TEowDp64EpPgbYVGrXlWK7infViNckabakDkkd3d3dL3snzMystmY5UV/rfEgMIF5TRMyLiPaIaG9paRlgimZm1p/BLiqPpUNXpK9bUrwLGFdqNxbY3E98bI24mZlVqN+iImmYpH/L1N9SoOcKrpnAklJ8RroKbBKwPR0eWw5MljQynaCfDCxPy56SNCld9TWjtC0zM6vI8P4aRMTzkp6WtH9EbK93w5KuBT4AjJLURXEV1yXAYkmzgIeB01LzZcBJQCfwNHBG6nurpIuBO1O7iyKi5+T/mRRXmL0G+Gn6mJlZhfotKskzwBpJK4Df9wQj4py+VoiI0/tYdEKNtgHM6WM7C4AFNeIdwKG7TtvMzAZTvUXlJ+ljZmbWp7qKSkQslPQaYHxEPNDgnMzMbIiq6+ovSX8OrAJ+luYPl7S0kYmZmdnQU+8lxRdSPCblCYCIWAVMaFBOZmY2RNVbVHbUuPKrz5sNzcxsz1Tvifq1kj4EDJPUBpwD/Hvj0jIzs6Go3pHK2cAhwLPAtcCTwCcblZSZmQ1N9V799TRwvqRLi9l4qrFpmZnZUFTv1V/vkrQGWE1xE+S9ko5sbGpmZjbU1HtOZT7w8Yj4BYCk9wHfBg5rVGJmZjb01HtO5ameggIQEb8EfAjMzMx2ssuRSuld8XdI+r8UJ+kD+GvglsamZmZmQ01/h7++2mv+gtK071MxM7Od7LKoRMRxg5WImZkNfXWdqJc0guJFWK3ldXb16Hszs6Hmwgsv3KP6bYR6r/5aBqwE1gAvNC4dMzMbyuotKq+OiE/n6lTSp4C/pTgvs4biTY+jgUXAAcDdwEci4jlJ+wDXAEcCvwP+OiIeSts5D5gFPA+cExHLc+VoZma7r95Lir8j6WOSRks6oOczkA4ljaF4dlh7RBwKDAOmA5cCl0VEG7CNoliQvm6LiLcBl6V2SJqY1jsEmAJ8Q9KwgeRkZmZ51FtUngO+DNwG3JU+HS+j3+HAayQNB/YFHgWOB65PyxcCp6TpaWmetPwESUrxRRHxbEQ8SPF++6NeRk5mZvYy1Xv469PA2yLi8ZfbYUQ8IukrwMPAfwE3UBSpJyJiR2rWBYxJ02OATWndHZK2A69P8ZWlTZfXMTOzCtRbVNYBT+foUNJIilHGBIqXfn0PmFqjac99MOpjWV/xWn3OBmYDjB8/fjczNoCj//noSvr91dm/qqRfMxuYeovK88AqSTdTPP4eGPAlxR8EHoyIbgBJPwDeC4yQNDyNVsYCm1P7LmAc0JUOl+0PbC3Fe5TX2UlEzAPmAbS3t/umTTOzBqn3nMqPgLkUL+a6q/QZiIeBSZL2TedGTgDuA24GTk1tZgJL0vTSNE9aflNERIpPl7SPpAlAG3DHAHMyM7MM6n2fysL+W9UnIm6XdD3FZcM7gHsoRhE/ARZJ+lKKzU+rzKe4+qyTYoQyPW1nnaTFFAVpBzAnIp7PlaeZme2+eu+of5Aa5ysi4i0D6TQiLmDn54gBbKTG1VsR8QxwWh/bmUsxgjIzsyZQ7zmV9tL0qyl+yQ/oPhUzM3vlquucSkT8rvR5JCK+RnFfiZmZ2R/Ue/jriNLsXhQjl9c1JCMzMxuy6j389VVePKeyA3iIPs5zmJnZnqveojIV+Et2fvT9dOCiBuRkZmZDVL1F5UcUd7/fDTzTuHTMzGwoq7eojI2IKQ3NxMzMhrx676j/d0nvaGgmZmY25NU7Unkf8D/TTZDPUjzMMSLisIZlZmZmQ87unKg3MzPbpXqf/fWbRidiZmZDX73nVMzMzPrlomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2VRSVCSNkHS9pPslrZf0HkkHSFohaUP6OjK1laTLJXVKWl1+DL+kman9Bkkz++7RzMwGQ1Ujla8DP4uIPwbeCawHzgVujIg24MY0D8WNl23pMxu4EkDSARSvJH43xWuIL+gpRGZmVo1BLyqS9gPeD8wHiIjnIuIJYBqwMDVbCJySpqcB10RhJTBC0mjgRGBFRGyNiG3ACsAPvTQzq1AVI5W3AN3AtyXdI+lbkv4IeENEPAqQvh6Y2o8BNpXW70qxvuJmZlaRKorKcOAI4MqI+BPg97x4qKsW1YjFLuIv3YA0W1KHpI7u7u7dzdfMzOpURVHpAroi4vY0fz1FkXksHdYifd1Saj+utP5YYPMu4i8REfMioj0i2ltaWrLtiJmZ7WzQi0pE/BbYJOngFDoBuA9YCvRcwTUTWJKmlwIz0lVgk4Dt6fDYcmCypJHpBP3kFDMzs4rU++j73M4Gvitpb2AjcAZFgVssaRbwMHBaarsMOAnoBJ5ObYmIrZIuBu5M7S6KiK2DtwtmZtZbJUUlIlYB7TUWnVCjbQBz+tjOAmBB3uzMzGygfEe9mZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlk1Vd9SbmVkdFn/vqEr6/avT7hjQeh6pmJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtlUVlQkDZN0j6T/l+YnSLpd0gZJ16VXDSNpnzTfmZa3lrZxXoo/IOnEavbEzMx6VDlS+QSwvjR/KXBZRLQB24BZKT4L2BYRbwMuS+2QNBGYDhwCTAG+IWnYIOVuZmY1VFJUJI0F/gfwrTQv4Hjg+tRkIXBKmp6W5knLT0jtpwGLIuLZiHgQ6ASqeZ6BmZkB1Y1UvgZ8Hnghzb8eeCIidqT5LmBMmh4DbAJIy7en9n+I11jHzMwqMOhFRdKfAVsi4q5yuEbT6GfZrtbp3edsSR2SOrq7u3crXzMzq18VI5WjgZMlPQQsojjs9TVghKSepyaPBTan6S5gHEBavj+wtRyvsc5OImJeRLRHRHtLS0vevTEzsz8Y9KISEedFxNiIaKU40X5TRHwYuBk4NTWbCSxJ00vTPGn5TRERKT49XR02AWgDBvasZjMzy6KZ3qfyBWCRpC8B9wDzU3w+8B1JnRQjlOkAEbFO0mLgPmAHMCcinh/8tM3MrEelRSUibgFuSdMbqXH1VkQ8A5zWx/pzgbmNy9DMzHaH76g3M7NsXFTMzCwbFxUzM8vGRcXMzLJxUTEzs2xcVMzMLBsXFTMzy8ZFxczMsnFRMTOzbJrpMS1mtgdZP/emQe/z7ecfP+h97mk8UjEzs2w8UrEh6+fvP7aSfo+99ed9LvuXz/x4EDN50Vlf/fNK+jXrzSMVMzPLxkXFzMyycVExM7NsXFTMzCybQS8qksZJulnSeknrJH0ixQ+QtELShvR1ZIpL0uWSOiWtlnREaVszU/sNkmb21aeZmQ2OKkYqO4DPRMTbgUnAHEkTgXOBGyOiDbgxzQNMpXj/fBswG7gSiiIEXAC8m+KNkRf0FCIzM6vGoBeViHg0Iu5O008B64ExwDRgYWq2EDglTU8DronCSmCEpNHAicCKiNgaEduAFcCUQdwVMzPrpdJzKpJagT8BbgfeEBGPQlF4gANTszHAptJqXSnWV9zMzCpSWVGR9Frg+8AnI+LJXTWtEYtdxGv1NVtSh6SO7u7u3U/WzMzqUskd9ZJeRVFQvhsRP0jhxySNjohH0+GtLSneBYwrrT4W2JziH+gVv6VWfxExD5gH0N7e/ofCc+TnrnnZ+zIQd315RiX9mpk1WhVXfwmYD6yPiP9TWrQU6LmCayawpBSfka4CmwRsT4fHlgOTJY1MJ+gnp5iZmVWkipHK0cBHgDWSVqXY/wIuARZLmgU8DJyWli0DTgI6gaeBMwAiYquki4E7U7uLImLr4OyCmZnVMuhFJSJ+Se3zIQAn1GgfwJw+trUAWJAvOzMzezl8R72ZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZeOiYmZm2biomJlZNi4qZmaWjYuKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtm4qJiZWTYuKmZmlo2LipmZZTPki4qkKZIekNQp6dyq8zEz25MN6aIiaRhwBTAVmAicLmlitVmZme25hnRRAY4COiNiY0Q8BywCplWck5nZHmuoF5UxwKbSfFeKmZlZBRQRVecwYJJOA06MiL9N8x8BjoqIs3u1mw3MTrMHAw9k6H4U8HiG7eTWjHk5p/o4p/o1Y16v9JzeHBEt/TUanqmzqnQB40rzY4HNvRtFxDxgXs6OJXVERHvObebQjHk5p/o4p/o1Y17OqTDUD3/dCbRJmiBpb2A6sLTinMzM9lhDeqQSETsknQUsB4YBCyJiXcVpmZntsYZ0UQGIiGXAsgq6zno4LaNmzMs51cc51a8Z83JODPET9WZm1lyG+jkVMzNrIi4qZmaWzR5ZVCS9UdIiSb+WdJ+kZZIOkrS2wf2eJmmdpBcktfdaVlVOX5Z0v6TVkn4oaUQT5HRxymeVpBskvanX8kryKvX/WUkhaVTVOUm6UNIj6Xu1StJJVeeU+j47PZNvnaR/qjonSdeVvkcPSVrVBDkdLmllyqlD0lG9lleV1zsl3SZpjaQfS9pvtzYQEXvUBxBwG/B3pdjhwDHA2gb3/XaKmy9vAdqbJKfJwPA0fSlwaRPktF9p+hzgqmb4XqW+xlFcbfgbYFTVOQEXAp+tEa8yp+OAfwP2SfMHVp1Tr/y+Cnyx6pyAG4Cpafok4JYm+fe7Ezg2TX8UuHh31t8TRyrHAf8dEVf1BCJiFaXHvUhqlfQLSXenz3tTfLSkW9NfFmslHSNpmKSr0/waSZ/qq+OIWB8Rte7mrzKnGyJiR5pdSXEDadU5PVma/SOgfDVJZXkllwGfb7KcaqkypzOBSyLi2dTvlibIqWf7Av4KuLYJcgqgZxSwPzvfuF1lXgcDt6bpFcBf7qLtSwz5S4oH4FDgrn7abAH+NCKekdRG8QPYDnwIWB4Rc1U8IXlfir8exkTEoQAqHT4agjl9FLiuGXKSNBeYAWyn+A/Wo7K8JJ0MPBIR9xa/m6rPKTlL0gygA/hMRGyrOKeDgGPSv+EzFCOpOyvOqccxwGMRsSHNV5nTJ4Hlkr5CcSrivaVlVea1FjgZWAKcxs5PLenXnjhSqcergG9KWgN8j+Kx+lAMC8+QdCHwjoh4CtgIvEXSP0uaAjxZa4PNnpOk84EdwHebIaeIOD8ixqV8ztqNnBqSl6R9gfOBL+5mLg3LKbkSeCvFL41HKQ7tVJ3TcGAkMAn4HLBYvapwBTn1OJ0XRyn1alROZwKfSj/nnwLmN0leHwXmSLoLeB3w3G5l1chjc834AU4Abq0RbyUdq6Q4Tt3z18NwYEep3ZuAjwFrgBkp9lqKIeKPKe7q7y+HW9j5nEqlOQEzKY7f7tssOZW282ZKx5Crygt4B8Vfhg+lzw7gYeCNTfS9KvdXWU7Az4APlOZ/DbRU/X1K23sMGFv1z1Nqt50X7xUU8GQz5NWrv4OAO+pp2/PZE0cqNwH7SPpYT0DSuyh+efXYH3g0Il4APkLxCBgkvRnYEhHfpPir4ggVVwDtFRHfB/43cMRQyin91fIF4OSIeLpJcmorzZ4M3F91XhGxJiIOjIjWiGileJjpERHx24q/V6NLs39Bceiisu9T8iPg+LStg4C9KZ6UW/X/vQ8C90dEVylWZU6bgWPT9PHAhtKyKn+mDkxf9wL+Hriqr7Y17U4FeqV8KKr4Yoq/oNYBPwHaePEvgDZgNcWJ638E/jPFZ1L8p70H+AUwAXgncDewKn2m7qLfv6D4ZfQsxV9My5sgp06Kk389ba9qgpy+n9ZfTfFX1Zhm+PfrlcNDpKu/Kv5efYfir9HVFA9THd0EOe0N/Gvaxt3A8VXnlLZxNaWrqarOCXgfxXmTe4HbgSObJK9PAP+RPpeQRlP1fvyYFjMzy2ZPPPxlZmYNsideUtxwkq4Aju4V/npEfLuKfMA57Y5mzMs51cc51a9Refnwl5mZZePDX2Zmlo2LipmZZeOiYpaRpBGSPl51HmZVcVExy2sEUHdRUeFl/T+U5AturGn4h9Esr0uAt6p4X8fNwGEUz8F6FfD3EbFEUivw07T8PcApkj5I8WSDzRR3Vj8bEWdJaqG4o3l82v4nI+JX6blOb6J4bMfjFA8RNKuci4pZXucCh0bE4WkEsW9EPJkek7FS0tLU7mDgjIj4uIqXkPU8OuMpikd03JvafR24LCJ+KWk8xbtc3p6WHQm8LyL+a3B2zax/LipmjSPgHyS9H3gBGAO8IS37TUSsTNNHAT+PiK0Akr5H8SA/KJ5XNbH0kN/9JL0uTS91QbFm46Ji1jgfpng675ER8d+SHgJenZb9vtRuV4+F3wt4T+/ikYrM72uuYVYhn6g3y+spindQQPEU2S2poBzHzk+YLbsDOFbSyHTIrPymvRsovU9G0uENyNksG49UzDKKiN9J+pWktRQvS/pjSR0UT4a9v491HpH0DxRPqt0M3Efxrg2Ac4ArJK2m+P96K/B3Dd4NswHzY1rMmoCk10bEf6aRyg8pXqL0w6rzMttdPvxl1hwuTJchrwUepHjRldmQ45GKmZll45GKmZll46JiZmbZuKiYmVk2LipmZpaNi4qZmWXjomJmZtn8f6hs7Hva5FFpAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.countplot(df['target'])\n",
    "plt.xlabel(\"targer\")\n",
    "plt.ylabel(\"number\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "样本类别基本还是保持平衡 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "X=df.drop([\"id\",\"target\"],axis=1)\n",
    "y=df[\"target\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将特征用稀疏矩阵的形式输入模型 训练速度更高"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "lightGBM支持类别型特征的输入，不必转化成独热编码，但需要将其分类转化成int型变量输入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.sparse import csc_matrix\n",
    "X=csc_matrix(X)\n",
    "# 标签处理,转换成1-9\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "le = LabelEncoder()\n",
    "y = le.fit_transform(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 调优LightGBM超参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lightgbm.sklearn import LGBMClassifier\n",
    "import lightgbm as lgbm\n",
    "from sklearn.model_selection import GridSearchCV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM的主要的超参包括：\n",
    "1. 树的数目n_estimators 和 学习率 learning_rate\n",
    "2. 树的最大深度max_depth 和 树的最大叶子节点数目num_leaves（LightGBM采用叶子优先的方式生成树，num_leaves很重要，设置成比 2^max_depth 小）\n",
    "3. 叶子结点的最小样本数:min_data_in_leaf(min_data, min_child_samples)\n",
    "4. 每棵树的列采样比例：feature_fraction/colsample_bytree\n",
    "5. 每棵树的行采样比例：bagging_fraction （需同时设置bagging_freq=1）/subsample\n",
    "6. 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)\n",
    "\n",
    "7. 两个非模型复杂度参数，但会影响模型速度和精度。可根据特征取值范围和样本数目修改这两个参数\n",
    "1）特征的最大bin数目max_bin：默认255；\n",
    "2）用来建立直方图的样本数目subsample_for_bin：默认200000。\n",
    "\n",
    "# ***\n",
    "对n_estimators，用LightGBM内嵌的cv函数调优，因为同XGBoost一样，LightGBM学习的过程内嵌了cv，速度极快。\n",
    "其他参数用GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare cross validation\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "\n",
    "kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#需要调的参数\n",
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.7,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "\n",
    "#读取数据并构造Datase\n",
    "lgbm_data=lgbm.Dataset(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. n_estimators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "直接调用lightgbm内嵌的交叉验证(cv)，可对连续的n_estimators参数进行快速交叉验证"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "而GridSearchCV只能对有限个参数进行交叉验证，且速度相对较慢"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgbm_params=params.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'boosting_type': 'gbdt',\n",
       " 'objective': 'multiclass',\n",
       " 'n_jobs': 4,\n",
       " 'learning_rate': 0.1,\n",
       " 'num_leaves': 80,\n",
       " 'max_depth': 7,\n",
       " 'max_bin': 127,\n",
       " 'subsample': 0.7,\n",
       " 'bagging_freq': 1,\n",
       " 'colsample_bytree': 0.7,\n",
       " 'num_class': 9}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lgbm_params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[10]\tcv_agg's multi_logloss: 1.04227 + 0.00229602\n",
      "[20]\tcv_agg's multi_logloss: 0.780179 + 0.00195221\n",
      "[30]\tcv_agg's multi_logloss: 0.664167 + 0.00200303\n",
      "[40]\tcv_agg's multi_logloss: 0.606596 + 0.00139898\n",
      "[50]\tcv_agg's multi_logloss: 0.573894 + 0.00131689\n",
      "[60]\tcv_agg's multi_logloss: 0.553312 + 0.00136051\n",
      "[70]\tcv_agg's multi_logloss: 0.53861 + 0.0014552\n",
      "[80]\tcv_agg's multi_logloss: 0.527778 + 0.00144727\n",
      "[90]\tcv_agg's multi_logloss: 0.519655 + 0.00176412\n",
      "[100]\tcv_agg's multi_logloss: 0.512617 + 0.00174557\n",
      "[110]\tcv_agg's multi_logloss: 0.507166 + 0.00195905\n",
      "[120]\tcv_agg's multi_logloss: 0.502736 + 0.00237554\n",
      "[130]\tcv_agg's multi_logloss: 0.49883 + 0.00258164\n",
      "[140]\tcv_agg's multi_logloss: 0.495476 + 0.00251386\n",
      "[150]\tcv_agg's multi_logloss: 0.49264 + 0.00247921\n",
      "[160]\tcv_agg's multi_logloss: 0.490272 + 0.00251774\n",
      "[170]\tcv_agg's multi_logloss: 0.488096 + 0.00264998\n",
      "[180]\tcv_agg's multi_logloss: 0.48612 + 0.00292814\n",
      "[190]\tcv_agg's multi_logloss: 0.484422 + 0.00302191\n",
      "[200]\tcv_agg's multi_logloss: 0.482893 + 0.00301716\n",
      "[210]\tcv_agg's multi_logloss: 0.481491 + 0.0030625\n",
      "[220]\tcv_agg's multi_logloss: 0.480504 + 0.00297745\n",
      "[230]\tcv_agg's multi_logloss: 0.479457 + 0.00302001\n",
      "[240]\tcv_agg's multi_logloss: 0.478624 + 0.0029872\n",
      "[250]\tcv_agg's multi_logloss: 0.477894 + 0.00301128\n",
      "[260]\tcv_agg's multi_logloss: 0.477267 + 0.00310885\n",
      "[270]\tcv_agg's multi_logloss: 0.476676 + 0.00316073\n",
      "[280]\tcv_agg's multi_logloss: 0.476341 + 0.00309127\n",
      "[290]\tcv_agg's multi_logloss: 0.475936 + 0.00303841\n",
      "[300]\tcv_agg's multi_logloss: 0.475855 + 0.00287127\n",
      "[310]\tcv_agg's multi_logloss: 0.475649 + 0.00299611\n",
      "[320]\tcv_agg's multi_logloss: 0.475649 + 0.00295994\n",
      "[330]\tcv_agg's multi_logloss: 0.475872 + 0.00284167\n"
     ]
    }
   ],
   "source": [
    "cv_result=lgbm.cv(lgbm_params,lgbm_data,nfold=3,metrics='multi_logloss',num_boost_round=10000,early_stopping_rounds=20,seed=3,verbose_eval=10)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4756183824327594\n",
      "313\n"
     ]
    }
   ],
   "source": [
    "print(min(cv_result['multi_logloss-mean']))\n",
    "print(len(cv_result['multi_logloss-mean']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "得到最优 n_estimators 固定分层，每次调参使用相同的分层进行训练"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. num_leaves & max_depth=7\n",
    "大致换算关系：num_leaves = 2^(max_depth)。它的值的设置应该小于 2^(max_depth)，否则可能会导致过拟合。\n",
    "num_leaves建议70-80，值越大模型越复杂，越容易过拟合\n",
    "相应的扩大max_depth=7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 3 candidates, totalling 9 fits\n",
      "[CV] num_leaves=70 ...................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...................... num_leaves=70, score=-0.477, total=  35.1s\n",
      "[CV] num_leaves=70 ...................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   35.0s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...................... num_leaves=70, score=-0.473, total=  35.6s\n",
      "[CV] num_leaves=70 ...................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.2min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...................... num_leaves=70, score=-0.480, total=  35.6s\n",
      "[CV] num_leaves=75 ...................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.8min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...................... num_leaves=75, score=-0.475, total=  42.9s\n",
      "[CV] num_leaves=75 ...................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.5min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...................... num_leaves=75, score=-0.472, total=  37.4s\n",
      "[CV] num_leaves=75 ...................................................\n",
      "[CV] ...................... num_leaves=75, score=-0.479, total=  37.8s\n",
      "[CV] num_leaves=80 ...................................................\n",
      "[CV] ...................... num_leaves=80, score=-0.475, total=  37.9s\n",
      "[CV] num_leaves=80 ...................................................\n",
      "[CV] ...................... num_leaves=80, score=-0.472, total=  34.5s\n",
      "[CV] num_leaves=80 ...................................................\n",
      "[CV] ...................... num_leaves=80, score=-0.479, total=  36.1s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  5.5min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.7,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=339,\n",
       "                                      n_jobs=4, num_leaves=31,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None, param_grid={'num_leaves': [70, 75, 80]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "#           'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.7,\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(num_leaves=[70,75,80])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'num_leaves': 80} -0.4754431198583293\n",
      "[-0.47653161 -0.47564784 -0.47544312]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "得到最优num_leaves 80"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. min_child_samples\n",
    "叶子节点的最小样本数目\n",
    "用于一个叶子节点的最小样本数（数据集）. 每个叶子的较大样本量将减少过拟合（但可能导致欠拟合）.\n",
    "搜索范围：10-50\n",
    "\n",
    "min_child_weight: 叶子的最小hessian和。与min_child_samples结合使用，较大的值可减少过拟合。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 4 candidates, totalling 12 fits\n",
      "[CV] min_child_samples=10 ............................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............... min_child_samples=10, score=-0.478, total=  36.2s\n",
      "[CV] min_child_samples=10 ............................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   36.1s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............... min_child_samples=10, score=-0.475, total=  36.5s\n",
      "[CV] min_child_samples=10 ............................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.2min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............... min_child_samples=10, score=-0.481, total=  37.2s\n",
      "[CV] min_child_samples=20 ............................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.8min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............... min_child_samples=20, score=-0.475, total=  33.8s\n",
      "[CV] min_child_samples=20 ............................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.4min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............... min_child_samples=20, score=-0.472, total=  34.0s\n",
      "[CV] min_child_samples=20 ............................................\n",
      "[CV] ............... min_child_samples=20, score=-0.479, total=  38.9s\n",
      "[CV] min_child_samples=30 ............................................\n",
      "[CV] ............... min_child_samples=30, score=-0.474, total=  33.0s\n",
      "[CV] min_child_samples=30 ............................................\n",
      "[CV] ............... min_child_samples=30, score=-0.472, total=  29.0s\n",
      "[CV] min_child_samples=30 ............................................\n",
      "[CV] ............... min_child_samples=30, score=-0.479, total=  29.9s\n",
      "[CV] min_child_samples=40 ............................................\n",
      "[CV] ............... min_child_samples=40, score=-0.471, total=  30.2s\n",
      "[CV] min_child_samples=40 ............................................\n",
      "[CV] ............... min_child_samples=40, score=-0.473, total=  28.2s\n",
      "[CV] min_child_samples=40 ............................................\n",
      "[CV] ............... min_child_samples=40, score=-0.480, total=  28.0s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  6.6min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.7,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,\n",
       "                                      n_jobs=4, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'min_child_samples': range(10, 50, 10)},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.7,\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(min_child_samples=range(10,50,10))\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'min_child_samples': 30} -0.47465215209018574\n",
      "[-0.47797835 -0.47527028 -0.47465215 -0.47478558]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "min_child_weight最优值30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.行采样参数 sub_samples/bagging_fraction 和列采样参数 sub_feature/feature_fraction/colsample_bytree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "列采样colsample_bytree,每棵树的特征子集占比，设置在0~1之间，可以加快训练速度，避免过拟合\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "行采样sub_samples,不进行重采样的随机选取部分样本数据，此外需要设置参数bagging_freq来作为采样的频率，即多少轮迭代做一次bagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 25 candidates, totalling 75 fits\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.476, total=  21.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.1s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.474, total=  21.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   42.4s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.482, total=  20.5s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.0min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.474, total=  22.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.4min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.474, total=  22.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.479, total=  21.9s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.472, total=  23.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.472, total=  23.3s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.477, total=  23.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.474, total=  24.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.471, total=  24.3s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.477, total=  24.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.474, total=  25.4s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.472, total=  25.3s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.479, total=  24.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.477, total=  21.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.476, total=  22.1s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.483, total=  21.9s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.474, total=  23.4s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.474, total=  23.1s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.482, total=  23.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.472, total=  27.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.473, total=  25.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.476, total=  24.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.474, total=  25.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.474, total=  25.9s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.479, total=  26.3s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.475, total=  27.2s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.474, total=  27.2s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.480, total=  26.9s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.477, total=  23.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.476, total=  23.8s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.481, total=  23.8s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.475, total=  25.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.473, total=  25.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.478, total=  25.0s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.474, total=  26.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.472, total=  26.6s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.479, total=  26.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.474, total=  27.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.473, total=  27.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.480, total=  27.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.474, total=  29.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.474, total=  29.1s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.481, total=  29.2s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.478, total=  26.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.477, total=  24.5s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.484, total=  24.5s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.477, total=  26.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.473, total=  25.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.480, total=  25.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.475, total=  27.3s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.473, total=  27.5s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.479, total=  28.6s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.475, total=  29.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.476, total=  31.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.480, total=  29.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.476, total=  36.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.474, total=  31.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.483, total=  30.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.476, total=  26.2s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.478, total=  26.6s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.483, total=  26.1s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.477, total=  29.7s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.476, total=  31.3s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.480, total=  28.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.473, total=  30.1s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.475, total=  29.9s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.481, total=  30.0s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.475, total=  31.6s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.475, total=  31.9s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.479, total=  31.8s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.477, total=  33.7s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.477, total=  34.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.484, total=  36.5s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed: 33.3min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=1.0,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=30,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,...\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=1.0,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'colsample_bytree': array([0.5, 0.6, 0.7, 0.8, 0.9]),\n",
       "                         'subsample': array([0.5, 0.6, 0.7, 0.8, 0.9])},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          #'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(subsample=np.arange(0.5,1.0,0.1),colsample_bytree=np.arange(0.5,1.0,0.1))\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'colsample_bytree': 0.5, 'subsample': 0.7} -0.4734486461965822\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 9 candidates, totalling 27 fits\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.1 ...................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.1, score=-0.472, total=  23.6s\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.1 ...................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.5s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.1, score=-0.472, total=  23.5s\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.1 ...................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   47.0s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.1, score=-0.475, total=  24.7s\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.5 ...................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.2min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.5, score=-0.471, total=  23.7s\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.5 ...................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.6min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.5, score=-0.469, total=  27.2s\n",
      "[CV] reg_alpha=0.1, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.1, reg_lambda=0.5, score=-0.474, total=  25.4s\n",
      "[CV] reg_alpha=0.1, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.1, reg_lambda=1, score=-0.470, total=  23.2s\n",
      "[CV] reg_alpha=0.1, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.1, reg_lambda=1, score=-0.469, total=  23.7s\n",
      "[CV] reg_alpha=0.1, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.1, reg_lambda=1, score=-0.474, total=  22.2s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.1 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.1, score=-0.471, total=  22.4s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.1 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.1, score=-0.469, total=  23.6s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.1 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.1, score=-0.471, total=  22.7s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.470, total=  22.8s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.469, total=  23.0s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.473, total=  24.6s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=1, score=-0.470, total=  25.9s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=1, score=-0.468, total=  23.3s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=1, score=-0.473, total=  22.6s\n",
      "[CV] reg_alpha=1, reg_lambda=0.1 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.1, score=-0.468, total=  24.0s\n",
      "[CV] reg_alpha=1, reg_lambda=0.1 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.1, score=-0.469, total=  25.6s\n",
      "[CV] reg_alpha=1, reg_lambda=0.1 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.1, score=-0.474, total=  30.8s\n",
      "[CV] reg_alpha=1, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.5, score=-0.470, total=  30.8s\n",
      "[CV] reg_alpha=1, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.5, score=-0.469, total=  22.7s\n",
      "[CV] reg_alpha=1, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=1, reg_lambda=0.5, score=-0.473, total=  22.9s\n",
      "[CV] reg_alpha=1, reg_lambda=1 .......................................\n",
      "[CV] .......... reg_alpha=1, reg_lambda=1, score=-0.471, total=  24.4s\n",
      "[CV] reg_alpha=1, reg_lambda=1 .......................................\n",
      "[CV] .......... reg_alpha=1, reg_lambda=1, score=-0.469, total=  23.5s\n",
      "[CV] reg_alpha=1, reg_lambda=1 .......................................\n",
      "[CV] .......... reg_alpha=1, reg_lambda=1, score=-0.474, total=  23.0s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed: 10.9min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=30,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,\n",
       "                                      n_jobs=4, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'reg_alpha': [0.1, 0.5, 1],\n",
       "                         'reg_lambda': [0.1, 0.5, 1]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(reg_alpha=[0.1,0.5,1],reg_lambda=[0.1,0.5,1])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reg_alpha': 0.5, 'reg_lambda': 0.1} -0.47030951438090207\n",
      "[-0.47310845 -0.4715962  -0.47096659 -0.47030951 -0.47071347 -0.4705304\n",
      " -0.47035312 -0.47050576 -0.47151166]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "似乎reg_lambda可以继续减小"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 5 candidates, totalling 15 fits\n",
      "[CV] reg_lambda=0.01 .................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .................... reg_lambda=0.01, score=-0.469, total=  22.0s\n",
      "[CV] reg_lambda=0.01 .................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.9s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .................... reg_lambda=0.01, score=-0.470, total=  23.8s\n",
      "[CV] reg_lambda=0.01 .................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   45.7s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .................... reg_lambda=0.01, score=-0.473, total=  22.7s\n",
      "[CV] reg_lambda=0.03 .................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.1min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .................... reg_lambda=0.03, score=-0.470, total=  24.9s\n",
      "[CV] reg_lambda=0.03 .................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.6min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .................... reg_lambda=0.03, score=-0.469, total=  23.5s\n",
      "[CV] reg_lambda=0.03 .................................................\n",
      "[CV] .................... reg_lambda=0.03, score=-0.474, total=  24.8s\n",
      "[CV] reg_lambda=0.05 .................................................\n",
      "[CV] .................... reg_lambda=0.05, score=-0.471, total=  23.8s\n",
      "[CV] reg_lambda=0.05 .................................................\n",
      "[CV] .................... reg_lambda=0.05, score=-0.469, total=  23.9s\n",
      "[CV] reg_lambda=0.05 .................................................\n",
      "[CV] .................... reg_lambda=0.05, score=-0.474, total=  23.1s\n",
      "[CV] reg_lambda=0.07 .................................................\n",
      "[CV] .................... reg_lambda=0.07, score=-0.471, total=  24.1s\n",
      "[CV] reg_lambda=0.07 .................................................\n",
      "[CV] .................... reg_lambda=0.07, score=-0.468, total=  26.3s\n",
      "[CV] reg_lambda=0.07 .................................................\n",
      "[CV] .................... reg_lambda=0.07, score=-0.474, total=  25.1s\n",
      "[CV] reg_lambda=0.09 .................................................\n",
      "[CV] .................... reg_lambda=0.09, score=-0.471, total=  24.5s\n",
      "[CV] reg_lambda=0.09 .................................................\n",
      "[CV] .................... reg_lambda=0.09, score=-0.468, total=  24.6s\n",
      "[CV] reg_lambda=0.09 .................................................\n",
      "[CV] .................... reg_lambda=0.09, score=-0.473, total=  26.0s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  6.0min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=30,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,\n",
       "                                      n_jobs=4, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.5, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'reg_lambda': [0.01, 0.03, 0.05, 0.07, 0.09]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'reg_alpha':0.5,#alpha固定 调lambda\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(reg_lambda=[0.01,0.03,0.05,0.07,0.09])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reg_lambda': 0.09} -0.47065404958882034\n",
      "[-0.47071087 -0.47101872 -0.4709814  -0.4708131  -0.47065405]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "似乎没什么明显改进，继续使用0.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 由于添加了正则（一定程度上减轻过拟合），适当增大max_bin，看看是否增大准确性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 3 candidates, totalling 9 fits\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.471, total=  24.4s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   24.3s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.469, total=  24.5s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   48.8s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.474, total=  24.2s\n",
      "[CV] max_bin=180 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.2min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=180, score=-0.471, total=  27.4s\n",
      "[CV] max_bin=180 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.7min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=180, score=-0.468, total=  26.0s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.474, total=  26.1s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.470, total=  25.2s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.469, total=  26.5s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.473, total=  28.4s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  3.9min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_depth=7,\n",
       "                                      min_child_samples=30,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,\n",
       "                                      n_jobs=4, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.5, reg_lambda=0.1,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None, param_grid={'max_bin': [150, 180, 200]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          #'max_bin': 127, #增大看看效果\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'reg_alpha':0.5,\n",
    "          'reg_lambda':0.1\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(max_bin=[150,180,200])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=kfold,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'max_bin': 200} -0.4706915107290532\n",
      "[-0.47122662 -0.47096882 -0.47069151]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "效果并不明显，继续采用初始值127"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 增大交叉验证的次数 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 10 folds for each of 1 candidates, totalling 10 fits\n",
      "[CV] max_bin=127 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=127, score=-0.472, total=  29.4s\n",
      "[CV] max_bin=127 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   29.3s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=127, score=-0.456, total=  30.6s\n",
      "[CV] max_bin=127 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   59.9s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=127, score=-0.454, total=  31.2s\n",
      "[CV] max_bin=127 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.5min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=127, score=-0.474, total=  28.5s\n",
      "[CV] max_bin=127 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.0min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=127, score=-0.454, total=  29.0s\n",
      "[CV] max_bin=127 .....................................................\n",
      "[CV] ........................ max_bin=127, score=-0.463, total=  29.2s\n",
      "[CV] max_bin=127 .....................................................\n",
      "[CV] ........................ max_bin=127, score=-0.459, total=  31.9s\n",
      "[CV] max_bin=127 .....................................................\n",
      "[CV] ........................ max_bin=127, score=-0.435, total=  30.1s\n",
      "[CV] max_bin=127 .....................................................\n",
      "[CV] ........................ max_bin=127, score=-0.455, total=  29.3s\n",
      "[CV] max_bin=127 .....................................................\n",
      "[CV] ........................ max_bin=127, score=-0.453, total=  29.2s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  5.0min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=10, error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_depth=7,\n",
       "                                      min_child_samples=30,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=313,\n",
       "                                      n_jobs=4, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.5, reg_lambda=0.1,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None, param_grid={'max_bin': [127]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'reg_alpha':0.5,\n",
    "          'reg_lambda':0.1\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(max_bin=[127])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=10,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.4574683976189454\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 调低学习率 增大训练时间 重新获得更优n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[10]\tcv_agg's multi_logloss: 1.52611 + 0.0021307\n",
      "[20]\tcv_agg's multi_logloss: 1.28532 + 0.00345903\n",
      "[30]\tcv_agg's multi_logloss: 1.12117 + 0.00434416\n",
      "[40]\tcv_agg's multi_logloss: 1.00141 + 0.00465642\n",
      "[50]\tcv_agg's multi_logloss: 0.91084 + 0.00506885\n",
      "[60]\tcv_agg's multi_logloss: 0.840819 + 0.00534364\n",
      "[70]\tcv_agg's multi_logloss: 0.784849 + 0.0056312\n",
      "[80]\tcv_agg's multi_logloss: 0.740825 + 0.00572434\n",
      "[90]\tcv_agg's multi_logloss: 0.705165 + 0.0057008\n",
      "[100]\tcv_agg's multi_logloss: 0.676266 + 0.00559602\n",
      "[110]\tcv_agg's multi_logloss: 0.652377 + 0.00565266\n",
      "[120]\tcv_agg's multi_logloss: 0.633011 + 0.00579102\n",
      "[130]\tcv_agg's multi_logloss: 0.61635 + 0.00576045\n",
      "[140]\tcv_agg's multi_logloss: 0.602737 + 0.00588275\n",
      "[150]\tcv_agg's multi_logloss: 0.590989 + 0.00587987\n",
      "[160]\tcv_agg's multi_logloss: 0.58088 + 0.00586232\n",
      "[170]\tcv_agg's multi_logloss: 0.57224 + 0.005922\n",
      "[180]\tcv_agg's multi_logloss: 0.564574 + 0.0059036\n",
      "[190]\tcv_agg's multi_logloss: 0.557762 + 0.00594724\n",
      "[200]\tcv_agg's multi_logloss: 0.551742 + 0.00591644\n",
      "[210]\tcv_agg's multi_logloss: 0.546356 + 0.00589273\n",
      "[220]\tcv_agg's multi_logloss: 0.541357 + 0.00583268\n",
      "[230]\tcv_agg's multi_logloss: 0.536786 + 0.00586584\n",
      "[240]\tcv_agg's multi_logloss: 0.532691 + 0.00584803\n",
      "[250]\tcv_agg's multi_logloss: 0.528934 + 0.00577741\n",
      "[260]\tcv_agg's multi_logloss: 0.525464 + 0.00578816\n",
      "[270]\tcv_agg's multi_logloss: 0.522219 + 0.00581414\n",
      "[280]\tcv_agg's multi_logloss: 0.519189 + 0.00577911\n",
      "[290]\tcv_agg's multi_logloss: 0.516417 + 0.00571525\n",
      "[300]\tcv_agg's multi_logloss: 0.513777 + 0.0056651\n",
      "[310]\tcv_agg's multi_logloss: 0.511275 + 0.00561827\n",
      "[320]\tcv_agg's multi_logloss: 0.508876 + 0.00561739\n",
      "[330]\tcv_agg's multi_logloss: 0.5066 + 0.00561032\n",
      "[340]\tcv_agg's multi_logloss: 0.504464 + 0.00558521\n",
      "[350]\tcv_agg's multi_logloss: 0.502478 + 0.00558938\n",
      "[360]\tcv_agg's multi_logloss: 0.500577 + 0.00550912\n",
      "[370]\tcv_agg's multi_logloss: 0.498714 + 0.00555481\n",
      "[380]\tcv_agg's multi_logloss: 0.496981 + 0.00559058\n",
      "[390]\tcv_agg's multi_logloss: 0.495371 + 0.00563349\n",
      "[400]\tcv_agg's multi_logloss: 0.493726 + 0.00562311\n",
      "[410]\tcv_agg's multi_logloss: 0.492201 + 0.0056302\n",
      "[420]\tcv_agg's multi_logloss: 0.490718 + 0.00560738\n",
      "[430]\tcv_agg's multi_logloss: 0.489256 + 0.00558006\n",
      "[440]\tcv_agg's multi_logloss: 0.487946 + 0.00558624\n",
      "[450]\tcv_agg's multi_logloss: 0.486639 + 0.00564293\n",
      "[460]\tcv_agg's multi_logloss: 0.485367 + 0.00571263\n",
      "[470]\tcv_agg's multi_logloss: 0.484125 + 0.00574112\n",
      "[480]\tcv_agg's multi_logloss: 0.482916 + 0.00577734\n",
      "[490]\tcv_agg's multi_logloss: 0.481814 + 0.00578885\n",
      "[500]\tcv_agg's multi_logloss: 0.48071 + 0.00583761\n",
      "[510]\tcv_agg's multi_logloss: 0.479682 + 0.00586418\n",
      "[520]\tcv_agg's multi_logloss: 0.478628 + 0.00591178\n",
      "[530]\tcv_agg's multi_logloss: 0.477592 + 0.00593082\n",
      "[540]\tcv_agg's multi_logloss: 0.476714 + 0.00590318\n",
      "[550]\tcv_agg's multi_logloss: 0.475757 + 0.00600137\n",
      "[560]\tcv_agg's multi_logloss: 0.474843 + 0.00602546\n",
      "[570]\tcv_agg's multi_logloss: 0.473991 + 0.00605419\n",
      "[580]\tcv_agg's multi_logloss: 0.473147 + 0.006082\n",
      "[590]\tcv_agg's multi_logloss: 0.472325 + 0.00608549\n",
      "[600]\tcv_agg's multi_logloss: 0.471483 + 0.00611207\n",
      "[610]\tcv_agg's multi_logloss: 0.470735 + 0.00611631\n",
      "[620]\tcv_agg's multi_logloss: 0.470061 + 0.00614426\n",
      "[630]\tcv_agg's multi_logloss: 0.469325 + 0.00614061\n",
      "[640]\tcv_agg's multi_logloss: 0.468616 + 0.00614422\n",
      "[650]\tcv_agg's multi_logloss: 0.467898 + 0.00617373\n",
      "[660]\tcv_agg's multi_logloss: 0.467234 + 0.00619956\n",
      "[670]\tcv_agg's multi_logloss: 0.466626 + 0.00630162\n",
      "[680]\tcv_agg's multi_logloss: 0.466017 + 0.00633804\n",
      "[690]\tcv_agg's multi_logloss: 0.465385 + 0.00635427\n",
      "[700]\tcv_agg's multi_logloss: 0.464773 + 0.00639753\n",
      "[710]\tcv_agg's multi_logloss: 0.4642 + 0.00640288\n",
      "[720]\tcv_agg's multi_logloss: 0.463657 + 0.00639882\n",
      "[730]\tcv_agg's multi_logloss: 0.463146 + 0.00640738\n",
      "[740]\tcv_agg's multi_logloss: 0.462612 + 0.00644907\n",
      "[750]\tcv_agg's multi_logloss: 0.462103 + 0.00647798\n",
      "[760]\tcv_agg's multi_logloss: 0.461593 + 0.00647604\n",
      "[770]\tcv_agg's multi_logloss: 0.461091 + 0.00647899\n",
      "[780]\tcv_agg's multi_logloss: 0.460572 + 0.0065005\n",
      "[790]\tcv_agg's multi_logloss: 0.460068 + 0.00652423\n",
      "[800]\tcv_agg's multi_logloss: 0.459625 + 0.00655201\n",
      "[810]\tcv_agg's multi_logloss: 0.459201 + 0.00659114\n",
      "[820]\tcv_agg's multi_logloss: 0.458751 + 0.00659346\n",
      "[830]\tcv_agg's multi_logloss: 0.458363 + 0.00659483\n",
      "[840]\tcv_agg's multi_logloss: 0.457974 + 0.00664178\n",
      "[850]\tcv_agg's multi_logloss: 0.457572 + 0.00664032\n",
      "[860]\tcv_agg's multi_logloss: 0.457243 + 0.00662612\n",
      "[870]\tcv_agg's multi_logloss: 0.456875 + 0.00664792\n",
      "[880]\tcv_agg's multi_logloss: 0.45653 + 0.00663599\n",
      "[890]\tcv_agg's multi_logloss: 0.4562 + 0.00666741\n",
      "[900]\tcv_agg's multi_logloss: 0.455903 + 0.00670111\n",
      "[910]\tcv_agg's multi_logloss: 0.455577 + 0.00674321\n",
      "[920]\tcv_agg's multi_logloss: 0.455239 + 0.00671308\n",
      "[930]\tcv_agg's multi_logloss: 0.454915 + 0.00675554\n",
      "[940]\tcv_agg's multi_logloss: 0.454605 + 0.00680988\n",
      "[950]\tcv_agg's multi_logloss: 0.454338 + 0.00678484\n",
      "[960]\tcv_agg's multi_logloss: 0.45407 + 0.00685173\n",
      "[970]\tcv_agg's multi_logloss: 0.453772 + 0.00688909\n",
      "[980]\tcv_agg's multi_logloss: 0.453521 + 0.00694276\n",
      "[990]\tcv_agg's multi_logloss: 0.453223 + 0.00693646\n",
      "[1000]\tcv_agg's multi_logloss: 0.452992 + 0.00696495\n",
      "[1010]\tcv_agg's multi_logloss: 0.452734 + 0.00699427\n",
      "[1020]\tcv_agg's multi_logloss: 0.452496 + 0.00696129\n",
      "[1030]\tcv_agg's multi_logloss: 0.452255 + 0.00700478\n",
      "[1040]\tcv_agg's multi_logloss: 0.452062 + 0.00700713\n",
      "[1050]\tcv_agg's multi_logloss: 0.451833 + 0.00699299\n",
      "[1060]\tcv_agg's multi_logloss: 0.451628 + 0.00701986\n",
      "[1070]\tcv_agg's multi_logloss: 0.451445 + 0.0070753\n",
      "[1080]\tcv_agg's multi_logloss: 0.451238 + 0.00707286\n",
      "[1090]\tcv_agg's multi_logloss: 0.451049 + 0.00708824\n",
      "[1100]\tcv_agg's multi_logloss: 0.45084 + 0.00709545\n",
      "[1110]\tcv_agg's multi_logloss: 0.450631 + 0.00710729\n",
      "[1120]\tcv_agg's multi_logloss: 0.450441 + 0.00713079\n",
      "[1130]\tcv_agg's multi_logloss: 0.450318 + 0.00717577\n",
      "[1140]\tcv_agg's multi_logloss: 0.450187 + 0.00717252\n",
      "[1150]\tcv_agg's multi_logloss: 0.450045 + 0.00720432\n",
      "[1160]\tcv_agg's multi_logloss: 0.449881 + 0.00720545\n",
      "[1170]\tcv_agg's multi_logloss: 0.449717 + 0.00725591\n",
      "[1180]\tcv_agg's multi_logloss: 0.449563 + 0.00726472\n",
      "[1190]\tcv_agg's multi_logloss: 0.449411 + 0.00729166\n",
      "[1200]\tcv_agg's multi_logloss: 0.449295 + 0.00733625\n",
      "[1210]\tcv_agg's multi_logloss: 0.449186 + 0.00734396\n",
      "[1220]\tcv_agg's multi_logloss: 0.449086 + 0.00738358\n",
      "[1230]\tcv_agg's multi_logloss: 0.448978 + 0.00737123\n",
      "[1240]\tcv_agg's multi_logloss: 0.448901 + 0.00739836\n",
      "[1250]\tcv_agg's multi_logloss: 0.448798 + 0.0073854\n",
      "[1260]\tcv_agg's multi_logloss: 0.448695 + 0.00737388\n",
      "[1270]\tcv_agg's multi_logloss: 0.448611 + 0.00735831\n",
      "[1280]\tcv_agg's multi_logloss: 0.448503 + 0.00740986\n",
      "[1290]\tcv_agg's multi_logloss: 0.448463 + 0.00742382\n",
      "[1300]\tcv_agg's multi_logloss: 0.448369 + 0.00741744\n",
      "[1310]\tcv_agg's multi_logloss: 0.448261 + 0.00742613\n",
      "[1320]\tcv_agg's multi_logloss: 0.448188 + 0.00746043\n",
      "[1330]\tcv_agg's multi_logloss: 0.448098 + 0.0074876\n",
      "[1340]\tcv_agg's multi_logloss: 0.44804 + 0.00753925\n",
      "[1350]\tcv_agg's multi_logloss: 0.447936 + 0.00754623\n",
      "[1360]\tcv_agg's multi_logloss: 0.447855 + 0.00753421\n",
      "[1370]\tcv_agg's multi_logloss: 0.447781 + 0.00755144\n",
      "[1380]\tcv_agg's multi_logloss: 0.447717 + 0.00756696\n",
      "[1390]\tcv_agg's multi_logloss: 0.447663 + 0.00757274\n",
      "[1400]\tcv_agg's multi_logloss: 0.447638 + 0.00761054\n",
      "[1410]\tcv_agg's multi_logloss: 0.447602 + 0.00765735\n",
      "[1420]\tcv_agg's multi_logloss: 0.447536 + 0.00766664\n",
      "[1430]\tcv_agg's multi_logloss: 0.447472 + 0.00767463\n",
      "[1440]\tcv_agg's multi_logloss: 0.447413 + 0.0077083\n",
      "[1450]\tcv_agg's multi_logloss: 0.447373 + 0.0077471\n",
      "[1460]\tcv_agg's multi_logloss: 0.447365 + 0.00774518\n",
      "[1470]\tcv_agg's multi_logloss: 0.44734 + 0.00780201\n",
      "[1480]\tcv_agg's multi_logloss: 0.447306 + 0.00779384\n",
      "[1490]\tcv_agg's multi_logloss: 0.447312 + 0.00781426\n",
      "[1500]\tcv_agg's multi_logloss: 0.447301 + 0.00785129\n",
      "[1510]\tcv_agg's multi_logloss: 0.447341 + 0.00784204\n",
      "[1520]\tcv_agg's multi_logloss: 0.447349 + 0.00784269\n"
     ]
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.03,#将学习率调低\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          #'n_estimators':313,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'reg_alpha':0.5,\n",
    "          'reg_lambda':0.1,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "\n",
    "lgbm_data = lgbm.Dataset(X,y)\n",
    "cv_result=lgbm.cv(params,lgbm_data,nfold=10,metrics='multi_logloss',num_boost_round=10000,early_stopping_rounds=20,seed=3,verbose_eval=10)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.44730137684258936\n",
      "1500\n"
     ]
    }
   ],
   "source": [
    "print(min(cv_result['multi_logloss-mean']))\n",
    "print(len(cv_result['multi_logloss-mean']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基本调整好，确定参数，训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(bagging_freq=1, boosting_type='gbdt', class_weight=None,\n",
       "               colsample_bytree=0.5, importance_type='split',\n",
       "               learning_rate=0.03, max_bin=127, max_depth=7,\n",
       "               min_child_samples=30, min_child_weight=0.001, min_split_gain=0.0,\n",
       "               n_estimators=1500, n_jobs=4, num_class=9, num_leaves=80,\n",
       "               objective='multiclass', random_state=None, reg_alpha=0.5,\n",
       "               reg_lambda=0.1, silent=False, subsample=0.7,\n",
       "               subsample_for_bin=200000, subsample_freq=0)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.03,#将学习率调低\n",
    "          'num_leaves': 80,#选用已经调好的叶子数目\n",
    "          'max_depth': 7,\n",
    "          'min_child_samples':30,#添加调好的子节点-\n",
    "          'n_estimators':1500,#选用已经调好的迭代次数\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'reg_alpha':0.5,\n",
    "          'reg_lambda':0.1,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "lgc=LGBMClassifier(silent=False,**params)\n",
    "lgc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle as pc\n",
    "pc.dump(lgc,open('lgbm_gbdt.pkl',\"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 读取模型和测试数据，对模型进行测试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_84</th>\n",
       "      <th>feat_85</th>\n",
       "      <th>feat_86</th>\n",
       "      <th>feat_87</th>\n",
       "      <th>feat_88</th>\n",
       "      <th>feat_89</th>\n",
       "      <th>feat_90</th>\n",
       "      <th>feat_91</th>\n",
       "      <th>feat_92</th>\n",
       "      <th>feat_93</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>14</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 94 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  feat_1  feat_2  feat_3  feat_4  feat_5  feat_6  feat_7  feat_8  feat_9  \\\n",
       "0   1       0       0       0       0       0       0       0       0       0   \n",
       "1   2       2       2      14      16       0       0       0       0       0   \n",
       "2   3       0       1      12       1       0       0       0       0       0   \n",
       "3   4       0       0       0       1       0       0       0       0       0   \n",
       "4   5       1       0       0       1       0       0       1       2       0   \n",
       "\n",
       "    ...     feat_84  feat_85  feat_86  feat_87  feat_88  feat_89  feat_90  \\\n",
       "0   ...           0        0       11        1       20        0        0   \n",
       "1   ...           0        0        0        0        0        4        0   \n",
       "2   ...           0        0        0        0        2        0        0   \n",
       "3   ...           0        3        1        0        0        0        0   \n",
       "4   ...           0        0        0        0        0        0        0   \n",
       "\n",
       "   feat_91  feat_92  feat_93  \n",
       "0        0        0        0  \n",
       "1        0        2        0  \n",
       "2        0        0        1  \n",
       "3        0        0        0  \n",
       "4        9        0        0  \n",
       "\n",
       "[5 rows x 94 columns]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"D:\\\\python\\\\csdn data\\\\otto\\\\otto_test.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgc = pc.load(open(\"lgbm_gbdt.pkl\",\"rb\"))\n",
    "tfidf = pc.load(open('tfidf.pkl','rb'))\n",
    "mms_tfidf = pc.load(open('mms_tfidf.pkl','rb'))\n",
    "mms_log = pc.load(open('mms_log.pkl','rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对测试数据进行特征工程"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取index和columns\n",
    "index = df['id']\n",
    "data = df.drop(['id'],axis=1)\n",
    "columns_name = data.columns\n",
    "\n",
    "# 处理数据\n",
    "df_log = np.log1p(data)\n",
    "df_tfidf = tfidf.fit_transform(data).toarray()\n",
    "df_log_mms = mms_log.fit_transform(df_log)\n",
    "df_tfidf_mms = mms_tfidf.fit_transform(df_tfidf)\n",
    "\n",
    "# 拼接数据\n",
    "X = np.hstack((df_log_mms,df_tfidf_mms))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 保存处理好的数据\n",
    "ttt = pd.DataFrame(data=pd.concat([index,pd.DataFrame(data=X,columns=(columns_name+'_log').append(columns_name+'_tfidf'))],axis=1))\n",
    "ttt.to_csv('Otto_test_final_data.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 预测数据\n",
    "result = lgc.predict_proba(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成提交数据\n",
    "col_name = ['Class_'+str(i) for i in range(1,10)]\n",
    "data = pd.DataFrame(data=result,columns=col_name)\n",
    "final = pd.concat([index,data],axis=1)\n",
    "# 保存结果\n",
    "final.to_csv('lgbm_gbdt_result.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提交结果 得分0.4416 排名519左右"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. 采用tfidf特征，采用LightGBM（goss）完成Otto商品分类，尽可能将超参数调到最优，并用该模型对测试数据进行测试，提交Kaggle网站，提交排名"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 直接读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_85.1</th>\n",
       "      <th>feat_86.1</th>\n",
       "      <th>feat_87.1</th>\n",
       "      <th>feat_88.1</th>\n",
       "      <th>feat_89.1</th>\n",
       "      <th>feat_90.1</th>\n",
       "      <th>feat_91.1</th>\n",
       "      <th>feat_92.1</th>\n",
       "      <th>feat_93.1</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.075886</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0.167949</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.162608</td>\n",
       "      <td>0.649561</td>\n",
       "      <td>0.289065</td>\n",
       "      <td>0.489076</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.008244</td>\n",
       "      <td>0.022456</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.124622</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.145988</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Class_1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 188 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id    feat_1  feat_2  feat_3    feat_4    feat_5    feat_6    feat_7  \\\n",
       "0   1  0.167949     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "1   2  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "2   3  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "3   4  0.167949     0.0     0.0  0.162608  0.649561  0.289065  0.489076   \n",
       "4   5  0.000000     0.0     0.0  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "     feat_8  feat_9   ...     feat_85.1  feat_86.1  feat_87.1  feat_88.1  \\\n",
       "0  0.000000     0.0   ...      0.075886   0.000000   0.000000        0.0   \n",
       "1  0.159571     0.0   ...      0.000000   0.000000   0.000000        0.0   \n",
       "2  0.159571     0.0   ...      0.000000   0.000000   0.000000        0.0   \n",
       "3  0.000000     0.0   ...      0.000000   0.008244   0.022456        0.0   \n",
       "4  0.000000     0.0   ...      0.124622   0.000000   0.000000        0.0   \n",
       "\n",
       "   feat_89.1  feat_90.1  feat_91.1  feat_92.1  feat_93.1   target  \n",
       "0        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "1        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "2        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "3        0.0   0.000000        0.0        0.0        0.0  Class_1  \n",
       "4        0.0   0.145988        0.0        0.0        0.0  Class_1  \n",
       "\n",
       "[5 rows x 188 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"Otto_train_log_tfidf2.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 分离数据\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "X = df.drop(['id','target'],axis=1)\n",
    "y = df['target']\n",
    "y = LabelEncoder().fit_transform(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 先通过初始值搜索一下较优遍历次数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[10]\tcv_agg's multi_logloss: 1.02686 + 0.00306731\n",
      "[20]\tcv_agg's multi_logloss: 0.775627 + 0.00226306\n",
      "[30]\tcv_agg's multi_logloss: 0.664676 + 0.00188421\n",
      "[40]\tcv_agg's multi_logloss: 0.60819 + 0.00183785\n",
      "[50]\tcv_agg's multi_logloss: 0.575625 + 0.00152395\n",
      "[60]\tcv_agg's multi_logloss: 0.554635 + 0.00174677\n",
      "[70]\tcv_agg's multi_logloss: 0.540595 + 0.00198809\n",
      "[80]\tcv_agg's multi_logloss: 0.530184 + 0.00186908\n",
      "[90]\tcv_agg's multi_logloss: 0.522135 + 0.00178978\n",
      "[100]\tcv_agg's multi_logloss: 0.515365 + 0.00151297\n",
      "[110]\tcv_agg's multi_logloss: 0.509988 + 0.00161748\n",
      "[120]\tcv_agg's multi_logloss: 0.505193 + 0.00182393\n",
      "[130]\tcv_agg's multi_logloss: 0.501212 + 0.00184706\n",
      "[140]\tcv_agg's multi_logloss: 0.498111 + 0.00231872\n",
      "[150]\tcv_agg's multi_logloss: 0.495389 + 0.00265716\n",
      "[160]\tcv_agg's multi_logloss: 0.492919 + 0.00270944\n",
      "[170]\tcv_agg's multi_logloss: 0.490693 + 0.00246162\n",
      "[180]\tcv_agg's multi_logloss: 0.488948 + 0.00245621\n",
      "[190]\tcv_agg's multi_logloss: 0.487247 + 0.00240815\n",
      "[200]\tcv_agg's multi_logloss: 0.485856 + 0.00258013\n",
      "[210]\tcv_agg's multi_logloss: 0.484794 + 0.00272698\n",
      "[220]\tcv_agg's multi_logloss: 0.483621 + 0.00289319\n",
      "[230]\tcv_agg's multi_logloss: 0.482906 + 0.00288398\n",
      "[240]\tcv_agg's multi_logloss: 0.482299 + 0.00301611\n",
      "[250]\tcv_agg's multi_logloss: 0.481916 + 0.00330678\n",
      "[260]\tcv_agg's multi_logloss: 0.481535 + 0.00303472\n",
      "[270]\tcv_agg's multi_logloss: 0.481271 + 0.00325026\n",
      "[280]\tcv_agg's multi_logloss: 0.481157 + 0.00338257\n",
      "[290]\tcv_agg's multi_logloss: 0.48117 + 0.00349827\n",
      "[300]\tcv_agg's multi_logloss: 0.481068 + 0.00360119\n"
     ]
    }
   ],
   "source": [
    "#需要调的参数\n",
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          #'max_bin': 127, \n",
    "          #'subsample': 0.7,\n",
    "          #'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "\n",
    "#读取数据并构造Datase\n",
    "lgbm_data=lgbm.Dataset(X,y)\n",
    "cv_result=lgbm.cv(params,lgbm_data,nfold=5,metrics='multi_logloss',num_boost_round=10000,early_stopping_rounds=20,seed=3,verbose_eval=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 确定参数 继续调 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 24 candidates, totalling 120 fits\n",
      "[CV] max_depth=5, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=5, num_leaves=60, score=-0.498, total=  17.9s\n",
      "[CV] max_depth=5, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.8s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=5, num_leaves=60, score=-0.502, total=  19.1s\n",
      "[CV] max_depth=5, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   36.9s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=5, num_leaves=60, score=-0.491, total=  17.8s\n",
      "[CV] max_depth=5, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   54.7s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=5, num_leaves=60, score=-0.480, total=  17.1s\n",
      "[CV] max_depth=5, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.2min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=5, num_leaves=60, score=-0.491, total=  19.2s\n",
      "[CV] max_depth=5, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=65, score=-0.498, total=  17.5s\n",
      "[CV] max_depth=5, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=65, score=-0.502, total=  17.6s\n",
      "[CV] max_depth=5, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=65, score=-0.491, total=  17.5s\n",
      "[CV] max_depth=5, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=65, score=-0.480, total=  19.9s\n",
      "[CV] max_depth=5, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=65, score=-0.491, total=  18.6s\n",
      "[CV] max_depth=5, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=70, score=-0.498, total=  20.5s\n",
      "[CV] max_depth=5, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=70, score=-0.502, total=  17.1s\n",
      "[CV] max_depth=5, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=70, score=-0.491, total=  16.9s\n",
      "[CV] max_depth=5, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=70, score=-0.480, total=  17.0s\n",
      "[CV] max_depth=5, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=70, score=-0.491, total=  17.4s\n",
      "[CV] max_depth=5, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=75, score=-0.498, total=  17.2s\n",
      "[CV] max_depth=5, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=75, score=-0.502, total=  17.2s\n",
      "[CV] max_depth=5, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=75, score=-0.491, total=  16.9s\n",
      "[CV] max_depth=5, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=75, score=-0.480, total=  17.1s\n",
      "[CV] max_depth=5, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=75, score=-0.491, total=  17.2s\n",
      "[CV] max_depth=5, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=80, score=-0.498, total=  17.0s\n",
      "[CV] max_depth=5, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=80, score=-0.502, total=  17.1s\n",
      "[CV] max_depth=5, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=80, score=-0.491, total=  17.2s\n",
      "[CV] max_depth=5, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=80, score=-0.480, total=  17.1s\n",
      "[CV] max_depth=5, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=80, score=-0.491, total=  17.3s\n",
      "[CV] max_depth=5, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=85, score=-0.498, total=  17.0s\n",
      "[CV] max_depth=5, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=85, score=-0.502, total=  17.4s\n",
      "[CV] max_depth=5, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=85, score=-0.491, total=  17.0s\n",
      "[CV] max_depth=5, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=85, score=-0.480, total=  17.0s\n",
      "[CV] max_depth=5, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=5, num_leaves=85, score=-0.491, total=  17.4s\n",
      "[CV] max_depth=6, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=60, score=-0.494, total=  22.0s\n",
      "[CV] max_depth=6, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=60, score=-0.491, total=  22.4s\n",
      "[CV] max_depth=6, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=60, score=-0.481, total=  21.9s\n",
      "[CV] max_depth=6, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=60, score=-0.473, total=  22.0s\n",
      "[CV] max_depth=6, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=60, score=-0.481, total=  22.1s\n",
      "[CV] max_depth=6, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=65, score=-0.491, total=  25.3s\n",
      "[CV] max_depth=6, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=65, score=-0.491, total=  21.6s\n",
      "[CV] max_depth=6, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=65, score=-0.481, total=  21.5s\n",
      "[CV] max_depth=6, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=65, score=-0.471, total=  21.4s\n",
      "[CV] max_depth=6, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=65, score=-0.480, total=  21.5s\n",
      "[CV] max_depth=6, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=70, score=-0.491, total=  21.5s\n",
      "[CV] max_depth=6, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=70, score=-0.491, total=  22.6s\n",
      "[CV] max_depth=6, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=70, score=-0.481, total=  24.1s\n",
      "[CV] max_depth=6, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=70, score=-0.471, total=  22.5s\n",
      "[CV] max_depth=6, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=70, score=-0.480, total=  21.6s\n",
      "[CV] max_depth=6, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=75, score=-0.491, total=  21.5s\n",
      "[CV] max_depth=6, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=75, score=-0.491, total=  21.5s\n",
      "[CV] max_depth=6, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=75, score=-0.481, total=  21.2s\n",
      "[CV] max_depth=6, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=75, score=-0.471, total=  21.3s\n",
      "[CV] max_depth=6, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=75, score=-0.480, total=  22.1s\n",
      "[CV] max_depth=6, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=80, score=-0.491, total=  21.4s\n",
      "[CV] max_depth=6, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=80, score=-0.491, total=  21.6s\n",
      "[CV] max_depth=6, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=80, score=-0.481, total=  21.3s\n",
      "[CV] max_depth=6, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=80, score=-0.471, total=  21.1s\n",
      "[CV] max_depth=6, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=80, score=-0.480, total=  21.4s\n",
      "[CV] max_depth=6, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=85, score=-0.491, total=  21.2s\n",
      "[CV] max_depth=6, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=85, score=-0.491, total=  21.3s\n",
      "[CV] max_depth=6, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=85, score=-0.481, total=  21.3s\n",
      "[CV] max_depth=6, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=85, score=-0.471, total=  21.3s\n",
      "[CV] max_depth=6, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=6, num_leaves=85, score=-0.480, total=  21.4s\n",
      "[CV] max_depth=7, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=60, score=-0.491, total=  26.5s\n",
      "[CV] max_depth=7, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=60, score=-0.492, total=  26.8s\n",
      "[CV] max_depth=7, num_leaves=60 ......................................\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ......... max_depth=7, num_leaves=60, score=-0.482, total=  31.3s\n",
      "[CV] max_depth=7, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=60, score=-0.468, total=  28.9s\n",
      "[CV] max_depth=7, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=60, score=-0.480, total=  27.0s\n",
      "[CV] max_depth=7, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=65, score=-0.485, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=65, score=-0.495, total=  27.1s\n",
      "[CV] max_depth=7, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=65, score=-0.483, total=  26.6s\n",
      "[CV] max_depth=7, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=65, score=-0.470, total=  26.8s\n",
      "[CV] max_depth=7, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=65, score=-0.477, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=70, score=-0.493, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=70, score=-0.489, total=  27.1s\n",
      "[CV] max_depth=7, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=70, score=-0.482, total=  26.7s\n",
      "[CV] max_depth=7, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=70, score=-0.471, total=  26.7s\n",
      "[CV] max_depth=7, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=70, score=-0.480, total=  27.2s\n",
      "[CV] max_depth=7, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=75, score=-0.493, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=75, score=-0.488, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=75, score=-0.480, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=75, score=-0.467, total=  26.8s\n",
      "[CV] max_depth=7, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=75, score=-0.479, total=  27.2s\n",
      "[CV] max_depth=7, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=80, score=-0.487, total=  28.4s\n",
      "[CV] max_depth=7, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=80, score=-0.491, total=  29.7s\n",
      "[CV] max_depth=7, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=80, score=-0.480, total=  29.9s\n",
      "[CV] max_depth=7, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=80, score=-0.471, total=  27.1s\n",
      "[CV] max_depth=7, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=80, score=-0.476, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=85, score=-0.488, total=  27.0s\n",
      "[CV] max_depth=7, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=85, score=-0.491, total=  27.2s\n",
      "[CV] max_depth=7, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=85, score=-0.479, total=  26.9s\n",
      "[CV] max_depth=7, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=85, score=-0.470, total=  27.0s\n",
      "[CV] max_depth=7, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=7, num_leaves=85, score=-0.479, total=  27.0s\n",
      "[CV] max_depth=8, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=60, score=-0.492, total=  32.2s\n",
      "[CV] max_depth=8, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=60, score=-0.492, total=  31.7s\n",
      "[CV] max_depth=8, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=60, score=-0.483, total=  31.7s\n",
      "[CV] max_depth=8, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=60, score=-0.469, total=  31.4s\n",
      "[CV] max_depth=8, num_leaves=60 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=60, score=-0.478, total=  31.6s\n",
      "[CV] max_depth=8, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=65, score=-3.992, total=  24.4s\n",
      "[CV] max_depth=8, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=65, score=-0.495, total=  32.5s\n",
      "[CV] max_depth=8, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=65, score=-0.480, total=  32.1s\n",
      "[CV] max_depth=8, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=65, score=-0.469, total=  32.2s\n",
      "[CV] max_depth=8, num_leaves=65 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=65, score=-0.483, total=  39.6s\n",
      "[CV] max_depth=8, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=70, score=-0.497, total=  34.9s\n",
      "[CV] max_depth=8, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=70, score=-0.493, total=  34.5s\n",
      "[CV] max_depth=8, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=70, score=-0.481, total=  33.1s\n",
      "[CV] max_depth=8, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=70, score=-0.472, total=  32.9s\n",
      "[CV] max_depth=8, num_leaves=70 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=70, score=-0.483, total=  32.8s\n",
      "[CV] max_depth=8, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=75, score=-0.492, total=  33.5s\n",
      "[CV] max_depth=8, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=75, score=-0.496, total=  33.0s\n",
      "[CV] max_depth=8, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=75, score=-0.481, total=  32.9s\n",
      "[CV] max_depth=8, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=75, score=-0.473, total=  33.2s\n",
      "[CV] max_depth=8, num_leaves=75 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=75, score=-0.481, total=  33.8s\n",
      "[CV] max_depth=8, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=80, score=-0.492, total=  32.8s\n",
      "[CV] max_depth=8, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=80, score=-0.493, total=  34.8s\n",
      "[CV] max_depth=8, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=80, score=-0.482, total=  34.0s\n",
      "[CV] max_depth=8, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=80, score=-0.470, total=  33.1s\n",
      "[CV] max_depth=8, num_leaves=80 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=80, score=-0.483, total=  34.5s\n",
      "[CV] max_depth=8, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=85, score=-0.494, total=  35.3s\n",
      "[CV] max_depth=8, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=85, score=-0.493, total=  33.9s\n",
      "[CV] max_depth=8, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=85, score=-0.480, total=  33.4s\n",
      "[CV] max_depth=8, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=85, score=-0.472, total=  33.3s\n",
      "[CV] max_depth=8, num_leaves=85 ......................................\n",
      "[CV] ......... max_depth=8, num_leaves=85, score=-0.479, total=  33.1s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed: 50.0min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(boosting_type='goss', class_weight=None,\n",
       "                                      colsample_bytree=1.0,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_depth=7,\n",
       "                                      min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=300,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=31,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=1.0,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'max_depth': [5, 6, 7, 8],\n",
       "                         'num_leaves': [60, 65, 70, 75, 80, 85]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#需要调的参数\n",
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          #'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':300,\n",
    "          #'max_bin': 127, \n",
    "          #'subsample': 0.7,\n",
    "          #'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(num_leaves=[60,65,70,75,80,85],max_depth=[5,6,7,8])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=5,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'max_depth': 7, 'num_leaves': 80} -0.48129651656314265\n",
      "[-0.49249418 -0.49249418 -0.49249418 -0.49249418 -0.49249418 -0.49249418\n",
      " -0.48405927 -0.4829281  -0.4829281  -0.4829281  -0.4829281  -0.4829281\n",
      " -0.48282237 -0.48187244 -0.48301588 -0.48159329 -0.48129652 -0.48143437\n",
      " -0.48296551 -1.18361783 -0.4851154  -0.48462657 -0.48396632 -0.48348835]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 25 candidates, totalling 125 fits\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.482, total=  20.3s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   20.2s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.489, total=  18.7s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   38.9s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.475, total=  18.7s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   57.7s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.467, total=  18.6s\n",
      "[CV] colsample_bytree=0.5, subsample=0.5 .............................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.3min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.5, subsample=0.5, score=-0.477, total=  19.3s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.482, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.489, total=  19.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.475, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.467, total=  19.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.6, score=-0.477, total=  19.4s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.482, total=  19.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.489, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.475, total=  19.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.467, total=  19.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7, score=-0.477, total=  19.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.482, total=  19.4s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.489, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.475, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.467, total=  19.0s\n",
      "[CV] colsample_bytree=0.5, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.7999999999999999, score=-0.477, total=  19.8s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.482, total=  19.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.489, total=  19.4s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.475, total=  19.2s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.467, total=  19.1s\n",
      "[CV] colsample_bytree=0.5, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.5, subsample=0.8999999999999999, score=-0.477, total=  19.3s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.489, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.487, total=  20.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.477, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.470, total=  20.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.5, score=-0.474, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.489, total=  20.5s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.487, total=  20.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.477, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.470, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.6, score=-0.474, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.489, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.487, total=  20.9s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.477, total=  20.5s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.470, total=  20.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7, score=-0.474, total=  22.4s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.489, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.487, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.477, total=  20.6s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.470, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.7999999999999999, score=-0.474, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.489, total=  20.7s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.487, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.477, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.470, total=  20.8s\n",
      "[CV] colsample_bytree=0.6, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.6, subsample=0.8999999999999999, score=-0.474, total=  20.8s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.485, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.489, total=  22.6s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.479, total=  22.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.466, total=  22.1s\n",
      "[CV] colsample_bytree=0.7, subsample=0.5 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.5, score=-0.477, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.485, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.489, total=  22.5s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.479, total=  22.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.466, total=  22.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.6 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.6, score=-0.477, total=  22.6s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.485, total=  22.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.489, total=  22.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.479, total=  22.1s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.466, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7 .............................\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7, score=-0.477, total=  22.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.485, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.489, total=  22.7s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.479, total=  22.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.466, total=  22.2s\n",
      "[CV] colsample_bytree=0.7, subsample=0.7999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.7999999999999999, score=-0.477, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.485, total=  22.1s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.489, total=  22.4s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.479, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.466, total=  22.3s\n",
      "[CV] colsample_bytree=0.7, subsample=0.8999999999999999 ..............\n",
      "[CV]  colsample_bytree=0.7, subsample=0.8999999999999999, score=-0.477, total=  22.6s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.487, total=  24.4s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.487, total=  24.2s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.477, total=  24.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.469, total=  24.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.5, score=-0.477, total=  23.9s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.487, total=  24.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.487, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.477, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.469, total=  23.9s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.6, score=-0.477, total=  24.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.487, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.487, total=  24.0s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.477, total=  23.6s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.469, total=  24.9s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7, score=-0.477, total=  23.9s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.487, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.487, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.477, total=  23.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.469, total=  23.6s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.7999999999999999, score=-0.477, total=  23.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.487, total=  23.8s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.487, total=  23.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.477, total=  23.7s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.469, total=  23.9s\n",
      "[CV] colsample_bytree=0.7999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.7999999999999999, subsample=0.8999999999999999, score=-0.477, total=  24.0s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.489, total=  25.2s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.487, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.479, total=  25.3s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.471, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.5 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.5, score=-0.478, total=  25.6s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.489, total=  25.3s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.487, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.479, total=  25.3s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.471, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.6 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.6, score=-0.478, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.489, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.487, total=  25.8s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.479, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.471, total=  25.7s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7 ..............\n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7, score=-0.478, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.489, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.487, total=  25.6s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.479, total=  25.2s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.471, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.7999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.7999999999999999, score=-0.478, total=  25.6s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.489, total=  25.3s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.487, total=  25.5s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.479, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.471, total=  25.4s\n",
      "[CV] colsample_bytree=0.8999999999999999, subsample=0.8999999999999999 \n",
      "[CV]  colsample_bytree=0.8999999999999999, subsample=0.8999999999999999, score=-0.478, total=  25.7s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed: 46.6min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(boosting_type='goss', class_weight=None,\n",
       "                                      colsample_bytree=1.0,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_depth=7,\n",
       "                                      min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=300,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=1.0,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'colsample_bytree': array([0.5, 0.6, 0.7, 0.8, 0.9]),\n",
       "                         'subsample': array([0.5, 0.6, 0.7, 0.8, 0.9])},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#需要调的参数\n",
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':300,\n",
    "          #'max_bin': 127, \n",
    "          #'subsample': 0.7,\n",
    "          #'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(subsample=np.arange(0.5,1.0,0.1),colsample_bytree=np.arange(0.5,1.0,0.1))\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=5,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'colsample_bytree': 0.5, 'subsample': 0.5} -0.47784405514826683\n",
      "[-0.47784406 -0.47784406 -0.47784406 -0.47784406 -0.47784406 -0.47933052\n",
      " -0.47933052 -0.47933052 -0.47933052 -0.47933052 -0.47917989 -0.47917989\n",
      " -0.47917989 -0.47917989 -0.47917989 -0.47960629 -0.47960629 -0.47960629\n",
      " -0.47960629 -0.47960629 -0.48074157 -0.48074157 -0.48074157 -0.48074157\n",
      " -0.48074157]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 9 candidates, totalling 45 fits\n",
      "[CV] reg_alpha=0, reg_lambda=0 .......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .......... reg_alpha=0, reg_lambda=0, score=-0.483, total=  16.7s\n",
      "[CV] reg_alpha=0, reg_lambda=0 .......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.6s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .......... reg_alpha=0, reg_lambda=0, score=-0.487, total=  17.4s\n",
      "[CV] reg_alpha=0, reg_lambda=0 .......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   34.1s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .......... reg_alpha=0, reg_lambda=0, score=-0.474, total=  16.8s\n",
      "[CV] reg_alpha=0, reg_lambda=0 .......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   50.9s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .......... reg_alpha=0, reg_lambda=0, score=-0.463, total=  17.9s\n",
      "[CV] reg_alpha=0, reg_lambda=0 .......................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.1min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] .......... reg_alpha=0, reg_lambda=0, score=-0.474, total=  17.4s\n",
      "[CV] reg_alpha=0, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=0.5, score=-0.482, total=  17.6s\n",
      "[CV] reg_alpha=0, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=0.5, score=-0.486, total=  16.4s\n",
      "[CV] reg_alpha=0, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=0.5, score=-0.474, total=  16.2s\n",
      "[CV] reg_alpha=0, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=0.5, score=-0.466, total=  15.7s\n",
      "[CV] reg_alpha=0, reg_lambda=0.5 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=0.5, score=-0.474, total=  16.7s\n",
      "[CV] reg_alpha=0, reg_lambda=1.0 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=1.0, score=-0.481, total=  16.0s\n",
      "[CV] reg_alpha=0, reg_lambda=1.0 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=1.0, score=-0.486, total=  15.5s\n",
      "[CV] reg_alpha=0, reg_lambda=1.0 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=1.0, score=-0.471, total=  16.5s\n",
      "[CV] reg_alpha=0, reg_lambda=1.0 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=1.0, score=-0.465, total=  15.2s\n",
      "[CV] reg_alpha=0, reg_lambda=1.0 .....................................\n",
      "[CV] ........ reg_alpha=0, reg_lambda=1.0, score=-0.473, total=  17.7s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=0, score=-0.479, total=  17.8s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=0, score=-0.484, total=  15.3s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=0, score=-0.469, total=  15.0s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=0, score=-0.462, total=  15.0s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=0.5, reg_lambda=0, score=-0.473, total=  15.3s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.479, total=  15.2s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.482, total=  15.6s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.473, total=  15.5s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.462, total=  15.3s\n",
      "[CV] reg_alpha=0.5, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=0.5, score=-0.472, total=  15.0s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=1.0, score=-0.480, total=  15.1s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=1.0, score=-0.484, total=  15.1s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=1.0, score=-0.469, total=  15.2s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=1.0, score=-0.460, total=  15.6s\n",
      "[CV] reg_alpha=0.5, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=0.5, reg_lambda=1.0, score=-0.472, total=  15.1s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=1.0, reg_lambda=0, score=-0.480, total=  15.5s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=1.0, reg_lambda=0, score=-0.482, total=  15.6s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=1.0, reg_lambda=0, score=-0.472, total=  15.7s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=1.0, reg_lambda=0, score=-0.459, total=  15.9s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0 .....................................\n",
      "[CV] ........ reg_alpha=1.0, reg_lambda=0, score=-0.469, total=  16.0s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=0.5, score=-0.481, total=  15.0s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=0.5, score=-0.480, total=  15.1s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=0.5, score=-0.472, total=  14.9s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=0.5, score=-0.463, total=  15.1s\n",
      "[CV] reg_alpha=1.0, reg_lambda=0.5 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=0.5, score=-0.470, total=  15.3s\n",
      "[CV] reg_alpha=1.0, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=1.0, score=-0.479, total=  15.0s\n",
      "[CV] reg_alpha=1.0, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=1.0, score=-0.479, total=  16.0s\n",
      "[CV] reg_alpha=1.0, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=1.0, score=-0.470, total=  16.1s\n",
      "[CV] reg_alpha=1.0, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=1.0, score=-0.459, total=  17.5s\n",
      "[CV] reg_alpha=1.0, reg_lambda=1.0 ...................................\n",
      "[CV] ...... reg_alpha=1.0, reg_lambda=1.0, score=-0.470, total=  16.3s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed: 11.9min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(boosting_type='goss', class_weight=None,\n",
       "                                      colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=300,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.5,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'reg_alpha': [0, 0.5, 1.0],\n",
       "                         'reg_lambda': [0, 0.5, 1.0]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#调正则\n",
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':300,\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.5,\n",
    "          #'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'num_class':9#多分类问题 设置分类样本类别\n",
    "         }\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(reg_alpha=[0,0.5,1.0], reg_lambda=[0,0.5,1.0])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=5,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'reg_alpha': 1.0, 'reg_lambda': 1.0} -0.4714646136865848\n",
      "[-0.47604271 -0.47638195 -0.47514158 -0.47343752 -0.47367839 -0.47303259\n",
      " -0.47268021 -0.47330783 -0.47146461]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 似乎l2正则参数为1时效果都偏好 就不使用l1正则了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 4 candidates, totalling 20 fits\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.481, total=  16.8s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.7s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.484, total=  16.4s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   33.1s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.472, total=  17.5s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   50.7s remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.461, total=  16.2s\n",
      "[CV] max_bin=150 .....................................................\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.1min remaining:    0.0s\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ........................ max_bin=150, score=-0.471, total=  17.9s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.482, total=  17.2s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.482, total=  17.9s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.473, total=  17.8s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.462, total=  18.0s\n",
      "[CV] max_bin=180 .....................................................\n",
      "[CV] ........................ max_bin=180, score=-0.471, total=  18.2s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.482, total=  18.1s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.483, total=  17.9s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.477, total=  17.6s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.464, total=  18.1s\n",
      "[CV] max_bin=200 .....................................................\n",
      "[CV] ........................ max_bin=200, score=-0.470, total=  17.3s\n",
      "[CV] max_bin=255 .....................................................\n",
      "[CV] ........................ max_bin=255, score=-0.481, total=  18.0s\n",
      "[CV] max_bin=255 .....................................................\n",
      "[CV] ........................ max_bin=255, score=-0.483, total=  18.0s\n",
      "[CV] max_bin=255 .....................................................\n",
      "[CV] ........................ max_bin=255, score=-0.473, total=  20.3s\n",
      "[CV] max_bin=255 .....................................................\n",
      "[CV] ........................ max_bin=255, score=-0.461, total=  21.0s\n",
      "[CV] max_bin=255 .....................................................\n",
      "[CV] ........................ max_bin=255, score=-0.472, total=  22.2s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  6.0min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "             estimator=LGBMClassifier(boosting_type='goss', class_weight=None,\n",
       "                                      colsample_bytree=0.5,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_depth=7,\n",
       "                                      min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=300,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=80,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=1, silent=False,\n",
       "                                      subsample=0.5, subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'max_bin': [150, 180, 200, 255]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':300,\n",
    "          #'max_bin': 127, \n",
    "          'subsample': 0.5,\n",
    "          #'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'num_class':9,\n",
    "          'reg_lambda':1\n",
    "         }\n",
    "\n",
    "lgc = LGBMClassifier(silent=False,**params)#**表示以字典的形式导入\n",
    "tuned_params = dict(max_bin=[150,180,200,255])\n",
    "gc = GridSearchCV(lgc,tuned_params,cv=5,verbose=5,refit=False,scoring=\"neg_log_loss\")\n",
    "gc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'max_bin': 150} -0.4737521084226448\n",
      "[-0.47375211 -0.47392072 -0.47528095 -0.47405732]\n"
     ]
    }
   ],
   "source": [
    "print(gc.best_params_,gc.best_score_)\n",
    "print(gc.cv_results_['mean_test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 效果变差了 继续使用之前的正则参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[10]\tcv_agg's multi_logloss: 1.52793 + 0.00202187\n",
      "[20]\tcv_agg's multi_logloss: 1.28673 + 0.00345897\n",
      "[30]\tcv_agg's multi_logloss: 1.1221 + 0.00405972\n",
      "[40]\tcv_agg's multi_logloss: 1.00451 + 0.00448379\n",
      "[50]\tcv_agg's multi_logloss: 0.915739 + 0.00473909\n",
      "[60]\tcv_agg's multi_logloss: 0.84682 + 0.00495259\n",
      "[70]\tcv_agg's multi_logloss: 0.79142 + 0.0053963\n",
      "[80]\tcv_agg's multi_logloss: 0.74767 + 0.00558495\n",
      "[90]\tcv_agg's multi_logloss: 0.712274 + 0.00570647\n",
      "[100]\tcv_agg's multi_logloss: 0.683331 + 0.0057814\n",
      "[110]\tcv_agg's multi_logloss: 0.659299 + 0.00594942\n",
      "[120]\tcv_agg's multi_logloss: 0.639773 + 0.00592318\n",
      "[130]\tcv_agg's multi_logloss: 0.622921 + 0.00608325\n",
      "[140]\tcv_agg's multi_logloss: 0.608784 + 0.00620008\n",
      "[150]\tcv_agg's multi_logloss: 0.596747 + 0.00628475\n",
      "[160]\tcv_agg's multi_logloss: 0.586398 + 0.00627849\n",
      "[170]\tcv_agg's multi_logloss: 0.577319 + 0.00632684\n",
      "[180]\tcv_agg's multi_logloss: 0.569263 + 0.00646747\n",
      "[190]\tcv_agg's multi_logloss: 0.562116 + 0.00644425\n",
      "[200]\tcv_agg's multi_logloss: 0.555921 + 0.00644379\n",
      "[210]\tcv_agg's multi_logloss: 0.550295 + 0.00636878\n",
      "[220]\tcv_agg's multi_logloss: 0.54511 + 0.0062912\n",
      "[230]\tcv_agg's multi_logloss: 0.54048 + 0.00622439\n",
      "[240]\tcv_agg's multi_logloss: 0.536183 + 0.00619732\n",
      "[250]\tcv_agg's multi_logloss: 0.532273 + 0.00619714\n",
      "[260]\tcv_agg's multi_logloss: 0.528696 + 0.00636095\n",
      "[270]\tcv_agg's multi_logloss: 0.525551 + 0.0063506\n",
      "[280]\tcv_agg's multi_logloss: 0.522509 + 0.00627955\n",
      "[290]\tcv_agg's multi_logloss: 0.519615 + 0.00619871\n",
      "[300]\tcv_agg's multi_logloss: 0.517 + 0.0062408\n",
      "[310]\tcv_agg's multi_logloss: 0.514588 + 0.00635018\n",
      "[320]\tcv_agg's multi_logloss: 0.512194 + 0.00638675\n",
      "[330]\tcv_agg's multi_logloss: 0.510041 + 0.00642604\n",
      "[340]\tcv_agg's multi_logloss: 0.508081 + 0.00644639\n",
      "[350]\tcv_agg's multi_logloss: 0.506126 + 0.00647987\n",
      "[360]\tcv_agg's multi_logloss: 0.504221 + 0.0065433\n",
      "[370]\tcv_agg's multi_logloss: 0.502569 + 0.00658662\n",
      "[380]\tcv_agg's multi_logloss: 0.500802 + 0.00658655\n",
      "[390]\tcv_agg's multi_logloss: 0.499225 + 0.00661213\n",
      "[400]\tcv_agg's multi_logloss: 0.497612 + 0.00661722\n",
      "[410]\tcv_agg's multi_logloss: 0.496198 + 0.0066022\n",
      "[420]\tcv_agg's multi_logloss: 0.49477 + 0.00666926\n",
      "[430]\tcv_agg's multi_logloss: 0.493441 + 0.00670295\n",
      "[440]\tcv_agg's multi_logloss: 0.492174 + 0.00675775\n",
      "[450]\tcv_agg's multi_logloss: 0.490903 + 0.00666666\n",
      "[460]\tcv_agg's multi_logloss: 0.489655 + 0.00665925\n",
      "[470]\tcv_agg's multi_logloss: 0.488394 + 0.00669101\n",
      "[480]\tcv_agg's multi_logloss: 0.487189 + 0.00669692\n",
      "[490]\tcv_agg's multi_logloss: 0.486111 + 0.00664744\n",
      "[500]\tcv_agg's multi_logloss: 0.485085 + 0.00667189\n",
      "[510]\tcv_agg's multi_logloss: 0.484157 + 0.00662433\n",
      "[520]\tcv_agg's multi_logloss: 0.483099 + 0.00665152\n",
      "[530]\tcv_agg's multi_logloss: 0.482117 + 0.00656035\n",
      "[540]\tcv_agg's multi_logloss: 0.481218 + 0.00656129\n",
      "[550]\tcv_agg's multi_logloss: 0.480334 + 0.00656018\n",
      "[560]\tcv_agg's multi_logloss: 0.479478 + 0.00661083\n",
      "[570]\tcv_agg's multi_logloss: 0.478637 + 0.00660802\n",
      "[580]\tcv_agg's multi_logloss: 0.477866 + 0.00658376\n",
      "[590]\tcv_agg's multi_logloss: 0.477101 + 0.00659011\n",
      "[600]\tcv_agg's multi_logloss: 0.476348 + 0.00654855\n",
      "[610]\tcv_agg's multi_logloss: 0.47557 + 0.00657116\n",
      "[620]\tcv_agg's multi_logloss: 0.474883 + 0.00656351\n",
      "[630]\tcv_agg's multi_logloss: 0.474162 + 0.00656978\n",
      "[640]\tcv_agg's multi_logloss: 0.473535 + 0.00660965\n",
      "[650]\tcv_agg's multi_logloss: 0.472895 + 0.00664147\n",
      "[660]\tcv_agg's multi_logloss: 0.472262 + 0.00664954\n",
      "[670]\tcv_agg's multi_logloss: 0.471609 + 0.00663561\n",
      "[680]\tcv_agg's multi_logloss: 0.471066 + 0.0066973\n",
      "[690]\tcv_agg's multi_logloss: 0.470469 + 0.00674239\n",
      "[700]\tcv_agg's multi_logloss: 0.46993 + 0.00676947\n",
      "[710]\tcv_agg's multi_logloss: 0.469426 + 0.00675856\n",
      "[720]\tcv_agg's multi_logloss: 0.468814 + 0.00677418\n",
      "[730]\tcv_agg's multi_logloss: 0.468339 + 0.0068342\n",
      "[740]\tcv_agg's multi_logloss: 0.467913 + 0.00686065\n",
      "[750]\tcv_agg's multi_logloss: 0.46746 + 0.00688482\n",
      "[760]\tcv_agg's multi_logloss: 0.466968 + 0.00689346\n",
      "[770]\tcv_agg's multi_logloss: 0.466491 + 0.00687842\n",
      "[780]\tcv_agg's multi_logloss: 0.466054 + 0.00693586\n",
      "[790]\tcv_agg's multi_logloss: 0.465565 + 0.00699023\n",
      "[800]\tcv_agg's multi_logloss: 0.465134 + 0.00704225\n",
      "[810]\tcv_agg's multi_logloss: 0.464736 + 0.00704877\n",
      "[820]\tcv_agg's multi_logloss: 0.464305 + 0.00704006\n",
      "[830]\tcv_agg's multi_logloss: 0.463869 + 0.00706762\n",
      "[840]\tcv_agg's multi_logloss: 0.463525 + 0.00705059\n",
      "[850]\tcv_agg's multi_logloss: 0.463198 + 0.00704694\n",
      "[860]\tcv_agg's multi_logloss: 0.462818 + 0.00707737\n",
      "[870]\tcv_agg's multi_logloss: 0.462485 + 0.00716313\n",
      "[880]\tcv_agg's multi_logloss: 0.462157 + 0.00719917\n",
      "[890]\tcv_agg's multi_logloss: 0.461797 + 0.00720509\n",
      "[900]\tcv_agg's multi_logloss: 0.461494 + 0.00717246\n",
      "[910]\tcv_agg's multi_logloss: 0.461205 + 0.00722922\n",
      "[920]\tcv_agg's multi_logloss: 0.460841 + 0.0072519\n",
      "[930]\tcv_agg's multi_logloss: 0.460535 + 0.0072231\n",
      "[940]\tcv_agg's multi_logloss: 0.460244 + 0.00725982\n",
      "[950]\tcv_agg's multi_logloss: 0.460011 + 0.00728418\n",
      "[960]\tcv_agg's multi_logloss: 0.459795 + 0.00728824\n",
      "[970]\tcv_agg's multi_logloss: 0.459546 + 0.00731106\n",
      "[980]\tcv_agg's multi_logloss: 0.459273 + 0.00733329\n",
      "[990]\tcv_agg's multi_logloss: 0.459018 + 0.00734619\n",
      "[1000]\tcv_agg's multi_logloss: 0.458796 + 0.00734225\n",
      "[1010]\tcv_agg's multi_logloss: 0.45857 + 0.00735831\n",
      "[1020]\tcv_agg's multi_logloss: 0.458346 + 0.00736974\n",
      "[1030]\tcv_agg's multi_logloss: 0.458071 + 0.00743308\n",
      "[1040]\tcv_agg's multi_logloss: 0.457889 + 0.0074205\n",
      "[1050]\tcv_agg's multi_logloss: 0.457641 + 0.00745499\n",
      "[1060]\tcv_agg's multi_logloss: 0.457458 + 0.00745772\n",
      "[1070]\tcv_agg's multi_logloss: 0.457263 + 0.00744555\n",
      "[1080]\tcv_agg's multi_logloss: 0.45708 + 0.00744845\n",
      "[1090]\tcv_agg's multi_logloss: 0.456862 + 0.00742671\n",
      "[1100]\tcv_agg's multi_logloss: 0.456671 + 0.0074689\n",
      "[1110]\tcv_agg's multi_logloss: 0.45651 + 0.0074565\n",
      "[1120]\tcv_agg's multi_logloss: 0.456368 + 0.00747383\n",
      "[1130]\tcv_agg's multi_logloss: 0.456211 + 0.0075054\n",
      "[1140]\tcv_agg's multi_logloss: 0.455974 + 0.0075265\n",
      "[1150]\tcv_agg's multi_logloss: 0.455804 + 0.00752419\n",
      "[1160]\tcv_agg's multi_logloss: 0.455722 + 0.00756159\n",
      "[1170]\tcv_agg's multi_logloss: 0.455527 + 0.0076431\n",
      "[1180]\tcv_agg's multi_logloss: 0.455429 + 0.00767035\n",
      "[1190]\tcv_agg's multi_logloss: 0.455369 + 0.00770644\n",
      "[1200]\tcv_agg's multi_logloss: 0.45527 + 0.00772594\n",
      "[1210]\tcv_agg's multi_logloss: 0.455211 + 0.00773452\n",
      "[1220]\tcv_agg's multi_logloss: 0.455091 + 0.0077559\n",
      "[1230]\tcv_agg's multi_logloss: 0.454974 + 0.00772773\n",
      "[1240]\tcv_agg's multi_logloss: 0.454834 + 0.00771328\n",
      "[1250]\tcv_agg's multi_logloss: 0.454811 + 0.00775075\n",
      "[1260]\tcv_agg's multi_logloss: 0.454734 + 0.00776654\n",
      "[1270]\tcv_agg's multi_logloss: 0.454614 + 0.0077897\n",
      "[1280]\tcv_agg's multi_logloss: 0.454555 + 0.00784853\n",
      "[1290]\tcv_agg's multi_logloss: 0.454515 + 0.00784565\n",
      "[1300]\tcv_agg's multi_logloss: 0.454457 + 0.00782495\n",
      "[1310]\tcv_agg's multi_logloss: 0.454369 + 0.00785384\n",
      "[1320]\tcv_agg's multi_logloss: 0.454241 + 0.00790828\n",
      "[1330]\tcv_agg's multi_logloss: 0.454177 + 0.00797469\n",
      "[1340]\tcv_agg's multi_logloss: 0.454165 + 0.00804088\n",
      "[1350]\tcv_agg's multi_logloss: 0.454129 + 0.00807524\n",
      "[1360]\tcv_agg's multi_logloss: 0.454056 + 0.00820595\n",
      "[1370]\tcv_agg's multi_logloss: 0.454009 + 0.00826905\n",
      "[1380]\tcv_agg's multi_logloss: 0.453952 + 0.00829944\n",
      "[1390]\tcv_agg's multi_logloss: 0.453941 + 0.00835544\n",
      "[1400]\tcv_agg's multi_logloss: 0.453867 + 0.00839294\n",
      "[1410]\tcv_agg's multi_logloss: 0.453786 + 0.00838528\n",
      "[1420]\tcv_agg's multi_logloss: 0.45385 + 0.00842762\n",
      "[1430]\tcv_agg's multi_logloss: 0.453741 + 0.00844923\n",
      "[1440]\tcv_agg's multi_logloss: 0.453672 + 0.00853078\n",
      "[1450]\tcv_agg's multi_logloss: 0.45362 + 0.00855902\n",
      "[1460]\tcv_agg's multi_logloss: 0.453625 + 0.00860746\n"
     ]
    }
   ],
   "source": [
    "#降低学习率，增大交叉验证次数，重新搜索n_estimators\n",
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.03,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          #'n_estimators':300,\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.5,\n",
    "          #'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'num_class':9,\n",
    "          'reg_lambda':1,\n",
    "          'reg_alpha':1\n",
    "         }\n",
    "cv_result=lgbm.cv(params,lgbm_data,nfold=10,metrics='multi_logloss',num_boost_round=10000,early_stopping_rounds=20,seed=3,verbose_eval=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4536061174729145\n",
      "1448\n"
     ]
    }
   ],
   "source": [
    "print(min(cv_result['multi_logloss-mean']))\n",
    "print(len(cv_result['multi_logloss-mean']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基本确定参数 提交结果 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(boosting_type='goss', class_weight=None, colsample_bytree=0.5,\n",
       "               importance_type='split', learning_rate=0.03, max_bin=127,\n",
       "               max_depth=7, min_child_samples=20, min_child_weight=0.001,\n",
       "               min_split_gain=0.0, n_estimators=1448, n_jobs=4, num_class=9,\n",
       "               num_leaves=80, objective='multiclass', random_state=None,\n",
       "               reg_alpha=1, reg_lambda=1, silent=False, subsample=0.5,\n",
       "               subsample_for_bin=200000, subsample_freq=0)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'goss',#Gradient-based One-Side Sampling (基于梯度的单侧采样)\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.03,\n",
    "          'num_leaves': 80,\n",
    "          'max_depth': 7,\n",
    "          'n_estimators':1448,\n",
    "          'max_bin': 127, \n",
    "          'subsample': 0.5,\n",
    "          #'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.5,\n",
    "          'num_class':9,\n",
    "          'reg_lambda':1,\n",
    "          'reg_alpha':1\n",
    "         }\n",
    "lgc=LGBMClassifier(silent=False,**params)\n",
    "lgc.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 保存模型\n",
    "import pickle as pc\n",
    "pc.dump(lgc,open('lgbm_goss_model.pkl','wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>feat_1</th>\n",
       "      <th>feat_2</th>\n",
       "      <th>feat_3</th>\n",
       "      <th>feat_4</th>\n",
       "      <th>feat_5</th>\n",
       "      <th>feat_6</th>\n",
       "      <th>feat_7</th>\n",
       "      <th>feat_8</th>\n",
       "      <th>feat_9</th>\n",
       "      <th>...</th>\n",
       "      <th>feat_84</th>\n",
       "      <th>feat_85</th>\n",
       "      <th>feat_86</th>\n",
       "      <th>feat_87</th>\n",
       "      <th>feat_88</th>\n",
       "      <th>feat_89</th>\n",
       "      <th>feat_90</th>\n",
       "      <th>feat_91</th>\n",
       "      <th>feat_92</th>\n",
       "      <th>feat_93</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>14</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 94 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  feat_1  feat_2  feat_3  feat_4  feat_5  feat_6  feat_7  feat_8  feat_9  \\\n",
       "0   1       0       0       0       0       0       0       0       0       0   \n",
       "1   2       2       2      14      16       0       0       0       0       0   \n",
       "2   3       0       1      12       1       0       0       0       0       0   \n",
       "3   4       0       0       0       1       0       0       0       0       0   \n",
       "4   5       1       0       0       1       0       0       1       2       0   \n",
       "\n",
       "    ...     feat_84  feat_85  feat_86  feat_87  feat_88  feat_89  feat_90  \\\n",
       "0   ...           0        0       11        1       20        0        0   \n",
       "1   ...           0        0        0        0        0        4        0   \n",
       "2   ...           0        0        0        0        2        0        0   \n",
       "3   ...           0        3        1        0        0        0        0   \n",
       "4   ...           0        0        0        0        0        0        0   \n",
       "\n",
       "   feat_91  feat_92  feat_93  \n",
       "0        0        0        0  \n",
       "1        0        2        0  \n",
       "2        0        0        1  \n",
       "3        0        0        0  \n",
       "4        9        0        0  \n",
       "\n",
       "[5 rows x 94 columns]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"D:\\\\python\\\\csdn data\\\\otto\\\\otto_test.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgc = pc.load(open(\"lgbm_goss_model.pkl\",\"rb\"))\n",
    "tfidf = pc.load(open('tfidf.pkl','rb'))\n",
    "mms_tfidf = pc.load(open('mms_tfidf.pkl','rb'))\n",
    "mms_log = pc.load(open('mms_log.pkl','rb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取index和columns\n",
    "index = df['id']\n",
    "data = df.drop(['id'],axis=1)\n",
    "columns_name = data.columns\n",
    "\n",
    "# 处理数据\n",
    "# transform counts to TFIDF features\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "tfidf = TfidfTransformer()\n",
    "df_log = np.log1p(data)\n",
    "df_tfidf = tfidf.fit_transform(data).toarray()\n",
    "df_log_mms = mms_log.fit_transform(df_log)\n",
    "df_tfidf_mms = mms_tfidf.fit_transform(df_tfidf)\n",
    "\n",
    "# 拼接数据\n",
    "X = np.hstack((df_log_mms,df_tfidf_mms))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 预测数据\n",
    "result = lgc.predict_proba(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成提交数据\n",
    "col_name = ['Class_'+str(i) for i in range(1,10)]\n",
    "data = pd.DataFrame(data=result,columns=col_name)\n",
    "final = pd.concat([index,data],axis=1)\n",
    "# 保存结果\n",
    "final.to_csv('lgbm_goss_result.csv',index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
